Introduction
The project is part of the learning by building the regression model section. The main objective of this project is to implement the regression method that we have been learned in the class. The specific goal of the project is to predict the heating load of residential buildings based on the following characteristics of the building: Relative compactness, Surface Area, Wall Area, Roof Area, Height, Orientation, Glazing Area, and Glazing Area Distribution. The prediction of the heat load is important to ensure energy-saving and improve system operation efficiency.
Data Set Information
The data was created by Angeliki Xifara (Civil/Structural Engineer) and was processed by Athanasios Tsanas (Oxford Centre for Industrial and Applied Mathematics, University of Oxford, UK). The data is collected from analysis utilizing 12 different building shapes simulated in Ecotect. The buildings differ concerning the compactness, the surface area, and the roof area, amongst other variables. The data set contains eight variables and two outcomes, denoted by Y1 and Y2. However, in this project, we only predict one outcome which is heating load (Y1). The link of data can be found in here.
Description for each parameter:
- X1 Relative Compactness
- X2 Surface Area
- X3 Wall Area
- X4 Roof Area
- X5 Overall Height
- X6 Orientation
- X7 Glazing Area
- X8 Glazing Area Distribution
- Y1 Heating Load
- Y2 Cooling Load
Data Preparation
Attach Libraries
Attach are all libraries used in this project.
library(dplyr)
library(ggplot2)
library(tidyverse)
library(tidyr)
library(GGally)
library(corrplot)
library(hrbrthemes)
library(wesanderson)
library(MASS)
library(car)
library(lmtest)
library(MLmetrics)
library(performance)
library(quantreg)
library(caret)
library(elasticnet)
library(knitr)
library(matrixStats)
library(plotly)
library(glue)
Read Data Frame
Read data and assigned as an object called energy.
energy <- read.csv("ENB2012_data.csv")
Observe Data Frame
Observe the data frame by using glimpse().
- There are 768 rows and 10 columns of data set.
- All columns data types are numeric.
glimpse(energy)
## Rows: 768
## Columns: 10
## $ X1 <dbl> 0.98, 0.98, 0.98, 0.98, 0.90, 0.90, 0.90, 0.90, 0.86, 0.86, 0.86, 0…
## $ X2 <dbl> 514.5, 514.5, 514.5, 514.5, 563.5, 563.5, 563.5, 563.5, 588.0, 588.…
## $ X3 <dbl> 294.0, 294.0, 294.0, 294.0, 318.5, 318.5, 318.5, 318.5, 294.0, 294.…
## $ X4 <dbl> 110.25, 110.25, 110.25, 110.25, 122.50, 122.50, 122.50, 122.50, 147…
## $ X5 <dbl> 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.…
## $ X6 <int> 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4…
## $ X7 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X8 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Y1 <dbl> 15.55, 15.55, 15.55, 15.55, 20.84, 21.46, 20.71, 19.68, 19.50, 19.9…
## $ Y2 <dbl> 21.33, 21.33, 21.33, 21.33, 28.28, 25.38, 25.16, 29.60, 27.30, 21.9…
Change coloumn names
The column names are a bit confusing. Let’s change it to the original name.
energy<-
energy %>%
rename(Compactness=X1,
Surface.Area =X2,
Wall.Area =X3,
Roof.Area = X4,
Height = X5,
Orientation = X6,
Glazing.Area =X7,
Glazing.Dist =X8,
Heating.Load =Y1,
Cooling.Load= Y2)
Feature Selection
Since we will only predict Heating.Load, we will remove column Cooling.Load
energy <- energy %>%
dplyr::select(-Cooling.Load)
Check missing value
Check missing value with colSums() and is.na(). There are no missing values in this data set.
summary3 <- as.data.frame(colSums(is.na(energy)))
rmarkdown::paged_table(summary3)
Exploration Dataframe
Correlation
Check the correlation between target variable and predictor variables.
ggcorr(energy, label = T, label_size = 3, hjust = 1)
Insight :
- Height, Compactness, Glazing.Area, Glazing.Dist and Wall.Area has a positive correlation with Heating.Load.
- Roof.Area and Surface.Area has a negative correlation with Heating.Load.
- Surface.Area and Compactness, Height and Compactness have a significant positive correlation.
- Significant negative correlation between Compactness and Roof.Area.
Checking for outliers
We want to know if we are dealing with outliers or not. We’ll create boxplot for all the variables in this dataset.
Form the dataset into longer direction by using pivot_longer()
energy.long <- pivot_longer(data = energy,
cols = c(Height, Compactness, Glazing.Area, Glazing.Dist, Wall.Area, Orientation, Surface.Area, Wall.Area, Roof.Area, Heating.Load ))
Create boxplot for all variables.
energy.long %>%
ggplot(aes(x = value, fill=name)) +
geom_boxplot(color="black")+
facet_wrap(~name, scales = "free")+
theme_bw()+
scale_fill_manual(values = wes_palette(21, name = "Darjeeling1", type = "continuous"))+
labs(title="Boxplot For Each Variables",
y="",
x="Values")+
theme_set(theme_minimal() + theme(text = element_text(family="Arial Narrow"))) +
theme(plot.title = element_text(size= 17, color = 'black'),
axis.title.x = element_text(size=12, color = 'black'),
axis.title.y = element_text(size = 12, color = 'black'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
legend.position = "")+
theme(strip.background =element_rect(fill="black"))+
theme(strip.text = element_text(colour = 'white'))
Insight:
- No outliers are found in this dataset.
- The value of the target variable (heating load) distributes between 12 and 32, with a median approximately at 19.
- Orientation and Height have a normal distribution. Distribution of Compactness, Heating.Load and Wall.Area is right skew, while the other distributions are left skew.
Linear Regression
In this section, we create a linear model regression based on a different number of the predictor.
- First part: We construct the linear model regression with the target variable is heating load and variable predictors are other variables. We name the model
model_energy_all. - Second part: We construct the linear model regression with the target variable is heating load and variable predictors, which have a significant p-value as described in
model_energy_all. We call the modelmodel_energy_few.
First Part (All Variables)
The multiple linear regression will be used to model the relationship between multiple scalar responses (predictor variables) and the target variable. We use the whole dataset regression to find the coefficient estimates of different factors.
- Create a model with target variable (Heating Load) and predictor variables of the other columns by using
lm() - Assign as a new object called
model_energy_all. - Observe the
summary()
# Create a linear regression model
model_energy_all <- lm(Heating.Load~., energy)
# Summary of the model
summary(model_energy_all)
##
## Call:
## lm(formula = Heating.Load ~ ., data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8965 -1.3196 -0.0252 1.3532 7.7052
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.014521 19.033607 4.414 1.16e-05 ***
## Compactness -64.773991 10.289445 -6.295 5.19e-10 ***
## Surface.Area -0.087290 0.017075 -5.112 4.04e-07 ***
## Wall.Area 0.060813 0.006648 9.148 < 2e-16 ***
## Roof.Area NA NA NA NA
## Height 4.169939 0.337990 12.337 < 2e-16 ***
## Orientation -0.023328 0.094705 -0.246 0.80550
## Glazing.Area 19.932680 0.813986 24.488 < 2e-16 ***
## Glazing.Dist 0.203772 0.069918 2.914 0.00367 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.934 on 760 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9154
## F-statistic: 1187 on 7 and 760 DF, p-value: < 2.2e-16
Insight:
- The adjusted R-squared is 0.9154, implying that all predictor variables can describe 91.54% target variable, the rest is explained by other variables that are not included in the model.
- Compactness, Surface.Area, Wall.Area, height, Glazing.Area and Glazing.Dist has a significant p-value, meaning that those variables tend to have a strong correlation with the heating load.
Second Part (Few Variables)
In this part, we will create multiple linear regression by using predictor variables, which have a significant p-value as described on the above insight. Then, assign it as a new object called model_energy_few.
# Create a linear regression model
model_energy_few <- lm(Heating.Load~Compactness+Surface.Area+Wall.Area+Height+Glazing.Area+Glazing.Dist, energy)
# Summary of the model
summary(model_energy_few)
##
## Call:
## lm(formula = Heating.Load ~ Compactness + Surface.Area + Wall.Area +
## Height + Glazing.Area + Glazing.Dist, data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9315 -1.3189 -0.0263 1.3586 7.7169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.932873 19.018972 4.413 1.17e-05 ***
## Compactness -64.773991 10.283093 -6.299 5.06e-10 ***
## Surface.Area -0.087290 0.017065 -5.115 3.97e-07 ***
## Wall.Area 0.060813 0.006644 9.153 < 2e-16 ***
## Height 4.169939 0.337781 12.345 < 2e-16 ***
## Glazing.Area 19.932680 0.813483 24.503 < 2e-16 ***
## Glazing.Dist 0.203772 0.069875 2.916 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9155
## F-statistic: 1387 on 6 and 761 DF, p-value: < 2.2e-16
Insight:
- The adjusted R-squared is 0.9155, implying that all predictor variables can describe 91.55% target variable, the rest is explained by other variables that are not included in the model.
- All variables in the model have a significant p-value.
Model Evaluation
Compare two models based on the R-squared and Error value. We will use Mean Absolute Error (MAE) because our dataset does not have outliers.
Insight: Both models have relatively similar values of adjusted R squared and MAE. In this case, I would choose the model with few variables as predictors, since this model is simpler.
Stepwise Regression
In this section, we would like to compare three methods of stepwise regression (backward, forward and both). Stepwise regression is the step-by-step iterative construction of a regression model that involves the selection of independent variables to be used in a final model.
Backward Method
Backward regression involves removing potential variables in succession and testing for statistical significance after each iteration until the lowest AIC is achieved.
# Create backward model
model_energy_backward <- stepAIC(object=model_energy_all,
direction="backward",
trace = 0)
# Summary of model
summary(model_energy_backward)
##
## Call:
## lm(formula = Heating.Load ~ Compactness + Surface.Area + Wall.Area +
## Height + Glazing.Area + Glazing.Dist, data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9315 -1.3189 -0.0263 1.3586 7.7169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.932873 19.018972 4.413 1.17e-05 ***
## Compactness -64.773991 10.283093 -6.299 5.06e-10 ***
## Surface.Area -0.087290 0.017065 -5.115 3.97e-07 ***
## Wall.Area 0.060813 0.006644 9.153 < 2e-16 ***
## Height 4.169939 0.337781 12.345 < 2e-16 ***
## Glazing.Area 19.932680 0.813483 24.503 < 2e-16 ***
## Glazing.Dist 0.203772 0.069875 2.916 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9155
## F-statistic: 1387 on 6 and 761 DF, p-value: < 2.2e-16
Forward Method
Forward regression involves adding potential variables in succession and testing for statistical significance after each iteration until the lowest AIC achieved.
# Create a model without the variable predictor
model_energy_non <- lm(Heating.Load~1, energy)
# Create forward model
model_energy_forward <- stepAIC(object=model_energy_non,
direction="forward",
scope=list(upper=model_energy_all,
lower=model_energy_non), trace=0)
# Summary of model
summary(model_energy_forward)
##
## Call:
## lm(formula = Heating.Load ~ Height + Glazing.Area + Wall.Area +
## Compactness + Surface.Area + Glazing.Dist, data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9315 -1.3189 -0.0263 1.3586 7.7169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.932873 19.018972 4.413 1.17e-05 ***
## Height 4.169939 0.337781 12.345 < 2e-16 ***
## Glazing.Area 19.932680 0.813483 24.503 < 2e-16 ***
## Wall.Area 0.060813 0.006644 9.153 < 2e-16 ***
## Compactness -64.773991 10.283093 -6.299 5.06e-10 ***
## Surface.Area -0.087290 0.017065 -5.115 3.97e-07 ***
## Glazing.Dist 0.203772 0.069875 2.916 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9155
## F-statistic: 1387 on 6 and 761 DF, p-value: < 2.2e-16
Both Method
Both stepwise regression involves adding and removing potential variables in succession and testing for statistical significance after each iteration until the lowest AIC achieved.
# Create a both method model
model_energy_both<- stepAIC(object=model_energy_non,
direction="both",
scope=list(upper=model_energy_all,
lower=model_energy_non),trace=0)
# Summary of model
summary(model_energy_both)
##
## Call:
## lm(formula = Heating.Load ~ Height + Glazing.Area + Wall.Area +
## Compactness + Surface.Area + Glazing.Dist, data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9315 -1.3189 -0.0263 1.3586 7.7169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.932873 19.018972 4.413 1.17e-05 ***
## Height 4.169939 0.337781 12.345 < 2e-16 ***
## Glazing.Area 19.932680 0.813483 24.503 < 2e-16 ***
## Wall.Area 0.060813 0.006644 9.153 < 2e-16 ***
## Compactness -64.773991 10.283093 -6.299 5.06e-10 ***
## Surface.Area -0.087290 0.017065 -5.115 3.97e-07 ***
## Glazing.Dist 0.203772 0.069875 2.916 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9155
## F-statistic: 1387 on 6 and 761 DF, p-value: < 2.2e-16
Model Valuation
compare_performance(model_energy_backward, model_energy_forward, model_energy_both)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC | AIC weights | BIC | BIC weights | R2 | R2 (adj.) | RMSE | Sigma
## -------------------------------------------------------------------------------------------------------------------
## model_energy_backward | lm | 3840.974 | 0.333 | 3878.125 | 0.333 | 0.916 | 0.916 | 2.919 | 2.933
## model_energy_forward | lm | 3840.974 | 0.333 | 3878.125 | 0.333 | 0.916 | 0.916 | 2.919 | 2.933
## model_energy_both | lm | 3840.974 | 0.333 | 3878.125 | 0.333 | 0.916 | 0.916 | 2.919 | 2.933
Insight:
- All models shows similar result of adjusted R-squared value, AIC and RMSE.
- The adjusted R-squared value of 91.6% (quite high), implying that the variables predictor could explain the heating load variable as variable target.
- The RMSE value is 2.919. It is implies that the number is realtively small compare to heating ranges, which is between 0 and 45.
- Since the R-squared value is quite high and RMSE is relatively small, we can justify that the model is relatively good. In this case, we will use model_energy_backward as our model. However, we should test our model with a few linear model assumption. Let’s move on.
Model Interpretation
Model interpretation for model_energy_backward is :
Heating.Load = -64.77 * Compactness - 0.089 * Surface.Area + 0.06 * Wall.Area + 4.17 * Height + 19.93 * Glazing.Area + 0.20 * Glazing.Dist
- One unit increase in compactness, reduces the heating load by 64.77
- One unit increase in surface area, reduces the heating load by 0.089
- One unit increase in wall area, increases the heating load by 0.06
- One unit increase in height, increases the heating load by 4.17
- One unit increase in glazing area, increases the heating load by 19.93
- One unit increase in glazing distribution, increases heating load by 0.20
Assumption
The following are assumptions of linear regression:
- There must be a linear relation between target and predictor variables (linearity)
- Absence of heteroscedasticity (Homoscedasticity)
- Error terms should be normally distributed with mean 0 and constant variance (Normality of Residual)
- Absence of multicollinearity and auto-correlation (Multicollinearity)
Linearity
Linearity assumption is that the correlation between variables predictor and variable target assumed to be linear. We will use cor.test() to check the p-value of all variable predictors in model_energy_backward. The p-value should less than alpha (p-value < 0.05), so that we will reject the \(H_0\). Linearity hypothesis test:
\[ H_0: correlation\ is\ not\ significant\\ H_1: correlation\ significant \]
cor.test(energy$Compactness, energy$Heating.Load)
##
## Pearson's product-moment correlation
##
## data: energy$Compactness and energy$Heating.Load
## t = 22.001, df = 766, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5769254 0.6637948
## sample estimates:
## cor
## 0.6222722
cor.test(energy$Surface.Area, energy$Heating.Load)
##
## Pearson's product-moment correlation
##
## data: energy$Surface.Area and energy$Heating.Load
## t = -24.192, df = 766, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6964395 -0.6160586
## sample estimates:
## cor
## -0.6581202
cor.test(energy$Wall.Area, energy$Heating.Load)
##
## Pearson's product-moment correlation
##
## data: energy$Wall.Area and energy$Heating.Load
## t = 14.168, df = 766, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3977488 0.5099758
## sample estimates:
## cor
## 0.4556712
cor.test(energy$Height, energy$Heating.Load)
##
## Pearson's product-moment correlation
##
## data: energy$Height and energy$Heating.Load
## t = 53.857, df = 766, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8736589 0.9033352
## sample estimates:
## cor
## 0.8894307
cor.test(energy$Glazing.Area, energy$Heating.Load)
##
## Pearson's product-moment correlation
##
## data: energy$Glazing.Area and energy$Heating.Load
## t = 7.756, df = 766, p-value = 2.796e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2029714 0.3342054
## sample estimates:
## cor
## 0.269841
cor.test(energy$Glazing.Dist, energy$Heating.Load)
##
## Pearson's product-moment correlation
##
## data: energy$Glazing.Dist and energy$Heating.Load
## t = 2.4273, df = 766, p-value = 0.01544
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0167267 0.1571406
## sample estimates:
## cor
## 0.08736759
Insight : Since the p-value of all variables in the model backward is less than alpha(0.05), we can assume that there is a linear relationship between the predictors and the outcome variables.
Normality of Residual
Residuals of the model should have a normal distribution and its values should be distributed within zero. To check this assumption, we could use visualization by using a histogram of residuals and a statistic test by using Shapiro.test(). The expectation is to have a p-value higher than alpha (p-value > 0.05) so that we can accept the \(H_0\).
\[ H_0: error\ is\ normally\ distributed\\ H_1: error\ is\ not\ normally\ distributed\\ \]
shapiro.test(model_energy_backward$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_energy_backward$residuals
## W = 0.9542, p-value = 1.006e-14
Insight: Since the p-value is lower than 0.05, we can assume that the residual in the model is not normally distributed. It implies that the model violates the residual normality assumption.
Homoscedasticity of Residual
Based on Investopedia, Homoskedastic (also spelt “homoscedastic”) is a condition in which the variance of the residual, in a regression model, is constant. The error term does not vary much as the value of the predictor variable changes. To observe the homoscedasticity, we can use a scatter plot between model residuals and the model prediction and statistic test with Breusch-Pagan. The expectation from the test is to have a p-value higher than alpha (p-value > 0.05), so that we can accept the \(H_0\).
\[ H_0: Error\ variances\ is\ constant\ (Homoscedasticity)\\ H_1: Error\ variance\ is\ not\ constant\ (Heteroscedasticity) \]
plot(x = model_energy_backward$fitted.values, y = model_energy_backward$residuals)
abline(h = 0, col = "red", lty = 2)
bptest(model_energy_backward)
##
## studentized Breusch-Pagan test
##
## data: model_energy_backward
## BP = 308.91, df = 6, p-value < 2.2e-16
Insight: Since the p-value is less than 0.05, there is Heteroscedasticity in the residuals of the model. It implies that the model violates the homoscedasticity assumption.
Multicolinearity
Multicollinearity is the occurrence of a high relationship between two or more variables predictors in a multiple regression model. The linear regression technique assumes that multicollinearity should not appear in the dataset because it causes difficulty in ranking variables based on their importance. This behaviour can be observed by using VIF(Variance Inflation Factor) value. The expectation is to have a value less than 10 so that there is no multicollinearity in the model.
vif(model_energy_backward)
## Compactness Surface.Area Wall.Area Height Glazing.Area Glazing.Dist
## 105.524054 201.531134 7.492984 31.205474 1.047508 1.047508
Insight: Three values (Compactness, Surface.Area and Height) have VIF of more than 10. This also means that the model violates the multicolinearity assumption.
Summary and Next Step
Based on the result, the model_energy_backward violates three of four assumptions for the linear regression model. In the next step, we will try to keep the linear model regression by handling three violations.
Handling The Violation
Normality of Residual
In this section, we will perform transformation for variable target of Heating.Load. The transformation includes log, z-score and square root. Stepwise backward regression method will be used to obtain a linear regression model. The expectation is to have p-value from shapiro.test() higher than alpha (0.05).
Log transformation
# Copy data frame energy and assign as new dataframe called `energy.trans`.
energy.trans <- energy
# Create a new column of heating containing transformation log of heating load.
energy.trans$Heating <- log(energy.trans$Heating.Load)
# Drop target variable
energy.trans <- energy.trans %>%
dplyr::select(-Heating.Load)
# Create a model with all the variables
model_log_all <- lm(Heating~., energy.trans)
# Create model linear regression with backward method.
model_log_backward <- stepAIC(object=model_log_all,
direction="backward",
trace=0)
# Summary of the model
summary(model_log_backward)
##
## Call:
## lm(formula = Heating ~ Surface.Area + Wall.Area + Height + Glazing.Area +
## Glazing.Dist, data = energy.trans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35668 -0.06061 0.00220 0.06112 0.31164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4428329 0.1346474 3.289 0.00105 **
## Surface.Area 0.0005337 0.0002073 2.574 0.01024 *
## Wall.Area 0.0018063 0.0002239 8.066 2.81e-15 ***
## Height 0.2547154 0.0106568 23.902 < 2e-16 ***
## Glazing.Area 1.0117131 0.0333801 30.309 < 2e-16 ***
## Glazing.Dist 0.0159934 0.0028672 5.578 3.38e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1203 on 762 degrees of freedom
## Multiple R-squared: 0.9367, Adjusted R-squared: 0.9363
## F-statistic: 2254 on 5 and 762 DF, p-value: < 2.2e-16
Z-score transformation
# Copy data frame energy and assign as new dataframe called `energy.trans1`.
energy.trans1 <- energy
# Create a new column of heating containing transformation z-score of heating load.
energy.trans1$Heating <- scale(energy.trans1$Heating.Load)
# Drop target variable
energy.trans1 <- energy.trans1 %>%
dplyr::select(-Heating.Load)
# Create model linear regression with all predictor.
model_scale_all <- lm(Heating~., energy.trans1)
# Create model linear regression with backward method.
model_scale_backward <- stepAIC(object=model_scale_all,
direction="backward",
trace=0)
# Summary of model
summary(model_scale_backward)
##
## Call:
## lm(formula = Heating ~ Compactness + Surface.Area + Wall.Area +
## Height + Glazing.Area + Glazing.Dist, data = energy.trans1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9843 -0.1307 -0.0026 0.1346 0.7648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1074804 1.8848962 3.240 0.00125 **
## Compactness -6.4194980 1.0191173 -6.299 5.06e-10 ***
## Surface.Area -0.0086510 0.0016912 -5.115 3.97e-07 ***
## Wall.Area 0.0060270 0.0006584 9.153 < 2e-16 ***
## Height 0.4132664 0.0334762 12.345 < 2e-16 ***
## Glazing.Area 1.9754503 0.0806212 24.503 < 2e-16 ***
## Glazing.Dist 0.0201950 0.0069250 2.916 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2906 on 761 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9155
## F-statistic: 1387 on 6 and 761 DF, p-value: < 2.2e-16
Square Root
# Copy data frame energy and assign as new dataframe called `energy.trans2`.
energy.trans2 <- energy
# Create a new column of heating containing transformation square root of heating load.
energy.trans2$Heating <- sqrt(energy.trans2$Heating.Load)
# Drop the target variable
energy.trans2 <- energy.trans2 %>%
dplyr::select(-Heating.Load)
# Create model linear regression with all predictor
model_square_all <- lm(Heating~., energy.trans2)
# Create model linear regression with backward method.
model_square_backward <- stepAIC(object=model_square_all,
direction="backward",
trace=0)
# Summary of model
summary(model_square_backward)
##
## Call:
## lm(formula = Heating ~ Compactness + Surface.Area + Wall.Area +
## Height + Glazing.Area + Glazing.Dist, data = energy.trans2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95619 -0.12757 0.00124 0.13235 0.76033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0105618 1.7870041 2.244 0.02510 *
## Compactness -2.8944136 0.9661894 -2.996 0.00283 **
## Surface.Area -0.0032031 0.0016034 -1.998 0.04611 *
## Wall.Area 0.0050696 0.0006242 8.121 1.86e-15 ***
## Height 0.5233942 0.0317376 16.491 < 2e-16 ***
## Glazing.Area 2.1824574 0.0764341 28.553 < 2e-16 ***
## Glazing.Dist 0.0277908 0.0065654 4.233 2.59e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2755 on 761 degrees of freedom
## Multiple R-squared: 0.9347, Adjusted R-squared: 0.9342
## F-statistic: 1817 on 6 and 761 DF, p-value: < 2.2e-16
Summary
The summary of R-squared, RMSE and p-value of shapiro test based on different models.
Insight:
- Three models of transformation are not capable to improve the p-value in the Shapiro test. It implies that all models still violate the residual normality.
- With the transformation log on the variable target, the adjusted R-squared value increased from 91.55% to 93.63% and the RMSE decreased from 2.91 to 0.27 compare to the model without transformation (model backwards).
Homoscedasticity
We will perform a z-score transformation for all variable predictors. The stepwise backward regression method will be used to obtain a linear regression model. The expectation is to have a p-value from bptest() higher than alpha (0.05).
Z-score transformation
# Copy data frame energy and assign as new dataframe called `energy.trans3`.
energy.trans3 <- energy
# Create a new column of heating containing transformation z-score of heating load.
energy.trans3 <- energy.trans3 %>%
dplyr::select(-Heating.Load)
# Combine the columns
energy.trans3 <- as.data.frame(cbind(scale(energy.trans3), Heating=energy$Heating.Load))
# Create a linear regression with all predictors
model_hete_all <- lm(Heating~., energy.trans3)
# Create linear regression with backward method
model_hete_backward <- stepAIC(object=model_hete_all,
direction="backward",
trace=0)
# Summary model
summary(model_hete_backward)
##
## Call:
## lm(formula = Heating ~ Compactness + Surface.Area + Wall.Area +
## Height + Glazing.Area + Glazing.Dist, data = energy.trans3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9315 -1.3189 -0.0263 1.3586 7.7169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.3072 0.1058 210.808 < 2e-16 ***
## Compactness -6.8516 1.0877 -6.299 5.06e-10 ***
## Surface.Area -7.6891 1.5032 -5.115 3.97e-07 ***
## Wall.Area 2.6531 0.2898 9.153 < 2e-16 ***
## Height 7.3021 0.5915 12.345 < 2e-16 ***
## Glazing.Area 2.6554 0.1084 24.503 < 2e-16 ***
## Glazing.Dist 0.3160 0.1084 2.916 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9155
## F-statistic: 1387 on 6 and 761 DF, p-value: < 2.2e-16
bptest(model_hete_backward)
##
## studentized Breusch-Pagan test
##
## data: model_hete_backward
## BP = 308.91, df = 6, p-value < 2.2e-16
Log transformation
# Copy data frame energy and assign as new dataframe called `energy.trans7`.
energy.trans7 <- energy
# Create a new column of heating containing transformation log of heating load.
energy.trans7 <- energy.trans7 %>%
dplyr::select(-Heating.Load)
# Log transformation to all predictor and drop the target
energy.trans7<- energy.trans7 %>%
mutate(log.compact =log(Compactness)) %>%
mutate(log.surface=log(Surface.Area)) %>%
mutate(log.wall=log(Wall.Area)) %>%
mutate(log.height=log(Height)) %>%
mutate(log.ori=log(Orientation)) %>%
mutate(log.area=log(Glazing.Area+1)) %>%
mutate(log.dist=log(Glazing.Dist+1)) %>%
dplyr::select(-Compactness,-Surface.Area,-Wall.Area,-Height, -Orientation, -Glazing.Area, -Glazing.Dist)
# Combine column
energy.trans7 <- as.data.frame(cbind(energy.trans7, Heating=energy$Heating.Load))
# Create a linear regression model with all predictors
model_all_trans <- lm(Heating~., energy.trans7)
# Create a linear regression model with backward method
model_backward_trans <- stepAIC(object=model_all_trans,
direction="backward",
trace=0)
# Summary of model
summary(model_backward_trans)
##
## Call:
## lm(formula = Heating ~ Roof.Area + log.compact + log.surface +
## log.height + log.area + log.dist, data = energy.trans7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1176 -1.4324 -0.1597 1.2933 7.7824
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.869e+03 1.942e+02 -9.626 < 2e-16 ***
## Roof.Area -1.144e-01 1.094e-02 -10.456 < 2e-16 ***
## log.compact 2.501e+02 2.978e+01 8.397 2.23e-16 ***
## log.surface 2.972e+02 3.118e+01 9.533 < 2e-16 ***
## log.height 2.668e+01 1.251e+00 21.322 < 2e-16 ***
## log.area 2.394e+01 9.692e-01 24.696 < 2e-16 ***
## log.dist 8.965e-01 2.146e-01 4.177 3.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.781 on 761 degrees of freedom
## Multiple R-squared: 0.9246, Adjusted R-squared: 0.924
## F-statistic: 1556 on 6 and 761 DF, p-value: < 2.2e-16
bptest(model_backward_trans)
##
## studentized Breusch-Pagan test
##
## data: model_backward_trans
## BP = 305.72, df = 6, p-value < 2.2e-16
Summary
The summary of R-squared, RMSE and shapiro.test p-value of different models.
Insight:
- The transformation in variable predictor could not improve the p-value in bptest, implying the model transformation still violate the homoscedasticity.
- The predictor log transformation improves the adjusted R-squared value and RMSE compared to the backward model. However, the transformation log in the target variable still shows the best model with the highest adjusted R-squared and the lowest error.
Multicolinearity
In this section, we will drop all variables that are identified as multicolinearity or having high correlation. The expectation is to have the VIF value less than 10.
# Copy data frame energy and assign as new dataframe called `energy.trans4`.
energy.trans4 <- energy
# Drop few predictors
energy.trans4 <- energy.trans4 %>%
dplyr::select(-Compactness, -Surface.Area, -Height)
# Create a linear regression model with all predictor
model_multi_all <- lm(Heating.Load~., energy.trans4)
# Create a linear regression model with backward
model_multi_backward <- stepAIC(object=model_multi_all,
direction="backward",
trace=0)
# Summary of model
summary(model_multi_backward)
##
## Call:
## lm(formula = Heating.Load ~ Wall.Area + Roof.Area + Glazing.Area +
## Glazing.Dist, data = energy.trans4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0601 -2.0663 -0.4254 1.2488 11.3291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.084407 1.354184 23.693 <2e-16 ***
## Wall.Area 0.051526 0.003254 15.835 <2e-16 ***
## Roof.Area -0.177986 0.003143 -56.630 <2e-16 ***
## Glazing.Area 19.932680 1.042939 19.112 <2e-16 ***
## Glazing.Dist 0.203772 0.089584 2.275 0.0232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.76 on 763 degrees of freedom
## Multiple R-squared: 0.8619, Adjusted R-squared: 0.8612
## F-statistic: 1190 on 4 and 763 DF, p-value: < 2.2e-16
vif(model_multi_backward)
## Wall.Area Roof.Area Glazing.Area Glazing.Dist
## 1.093433 1.093433 1.047508 1.047508
Insight: After dropping three variables, the vif value is less than 10. However, the p-value from the Shapiro test and bptest are still showing a violation of normality and homoscedasticity. Note: I have tried a log transformation in target by using only variables that vif value less than 10, however, the result are similar in that our model is not valid for error normality.
Summary And Next Step
- Transformation to variable target and variable predictors have been done to handle the normality of residuals, homoscedasticity and multicolienarity. However, the transformation only changes the violation in multicolinearity. It implies that the linear model regression is not an ‘appropriate’ method for predicting our target variable (heating load).
- A linear models regression with all stepwise methods has a similar value of adjusted R-squared value and the lowest RMSE.
- In the next section, we will try to find the most appropriate model to predict heating load by comparing a mean error and R square values.
Model Comparison
In this section, we would like to compare different models (linear regression, ridge regression, decision tree, and random forest) to find the lowest RMSE value. The model is built by using function train() from library caret.
Cross Validation
Split Train-Test Data
We split the dataset into train and test by using function sample(). The train dataset will contain 80% of the energy dataframe. Train data is used to train the model, while the rest is used to test the model.
# Lock random samples
RNGkind(sample.kind = "Rounding")
set.seed(100)
# Create train dataset with 80% of total row in dataframe and the rest is used to test the model.
insample <- sample(nrow(energy), nrow(energy)*0.8)
# Create energy_train as our train data
energy_train <- energy[insample,]
# Create energy_test as our test data
energy_test <- energy[-insample,]
K-Fold Cross-Validation
For all models, we use K-fold validation. The K-fold cross-validation splits the dataset into a K number of the fold, which each fold is used as a testing set at some point. To use K-fold, we need to specify the function trainControl() with method="repeatedcv". The number =5 means that there will be five folds or sets of split data. We assign it as a new object called control.
# Lock random samples
RNGkind(sample.kind = "Rounding")
set.seed(100)
control <- trainControl(method="repeatedcv", number=5, repeats=3)
Linear Regression
Modelling
We create a linear regression by using function train() with method = lm
# Lock random samples
RNGkind(sample.kind = "Rounding")
set.seed(100)
# Create model a linear regression with all predictors
model_lr <- train(Heating.Load ~.,
data = energy_train,
method = "lm",
trControl = control)
# Summary of model
model_lr
## Linear Regression
##
## 614 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 493, 492, 490, 491, 490, 490, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.983904 0.9117096 2.105365
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Insight :
- The adjusted R-squared value of training data is 91.2%, implying that the variables predictor could explain 91.2% of the heating load as a variable target.
- The RMSE value in training data is 2.98. It is implied that the number is relatively small compared to heating ranges, which is between 0 and 45.
Prediction and Results
# Prediction of model
pred_lr <- predict(object=model_lr, newdata=energy_test)
# RMSE value of model prediction
rmse_lr <- MLmetrics::RMSE(y_pred=pred_lr, y_true=energy_test$Heating.Load)
# R-square value of model prediction
R_lr <- R2_Score(y_pred = pred_lr, y_true = energy_test$Heating.Load)
Insight :
- The adjusted R-squared value of test data is 92.90%, implying that the variables predictor could explain 92.90% of the heating load as a variable target.
- The RMSE value in test data is 2.78. It is implied that the number is relatively small compared to heating ranges, which is between 0 and 45.
Ridge Regression
Ridge regression is a technique of estimating the coefficients of multiple-regression models where linearly predictors are most highly correlated. Since our variables suffer from multicolinearity, we can try to use this model.
Modelling
We create a ridge regression by using function train() with method ridge.
# Lock random samples
RNGkind(sample.kind = "Rounding")
set.seed(100)
# Create model a linear regression
model_rg <- train(Heating.Load ~.,
data = energy_train,
method = "ridge",
trControl = control)
# View of model
model_rg
## Ridge Regression
##
## 614 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 493, 492, 490, 491, 490, 490, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0e+00 2.98 0.912 2.11
## 1e-04 2.98 0.912 2.11
## 1e-01 3.25 0.897 2.35
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 1e-04.
Insight :
- The lambda parameter controls the amount of regularization applied to the model. The best lambda, which defines the best model, is 0.0001.
- The adjusted R-squared value of training data is 91.2%, implying that the variables predictor could explain 91.2% of the heating load as a variable target.
- The RMSE value in training data is 2.98. It is implied that the number is relatively small compared to heating ranges, which is between 0 and 45.
Prediction and Results
# Prediction of model
pred_rg <- predict(object=model_rg, newdata=energy_test)
# RMSE value of model prediction
rmse_rg <- MLmetrics::RMSE(y_pred=pred_rg, y_true=energy_test$Heating.Load)
# R-square value of model prediction
R_rg <- R2_Score(y_pred = pred_rg, y_true = energy_test$Heating.Load)
Insight :
- The adjusted R-squared value of test data is 92.90%, implying that the variables predictor could explain 92.90% of the heating load as a variable target.
- The RMSE value in test data is 2.78. It is implied that the number is relatively small compared to heating ranges, which is between 0 and 45.
Decision Tree
Decision tree is the most popular machine learning algorithms due to its simplicity and intelligibility. Decision tree can be used to predict when the target variable is the class (discrete) or a real number (numeric).
Modelling
We create a decision tree by using function train() with method=rpart
# Lock random samples
RNGkind(sample.kind = "Rounding")
set.seed(100)
# Create model a linear regression
model_dt <- train(Heating.Load ~.,
data = energy_train,
method = "rpart",
trControl = control)
# View of model
model_dt
## CART
##
## 614 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 493, 492, 490, 491, 490, 490, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.0272 3.39 0.884 2.64
## 0.0853 4.30 0.814 3.29
## 0.7884 6.85 0.773 5.81
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.0272.
Insight :
- The complexity parameter (cp) controls is to control the size of the decision tree. If the value of cp is less than the cost of adding other variables to the tree, then the tree stops growing. The optimum cp, which provides the best model, is 0.0272.
- The adjusted R-squared value of training data is 88.40%, implying that the variables predictor could explain 88.40% of the heating load as a variable target.
- The RMSE value in training data is 3.39. It implied that the number is relatively small compared to heating ranges, which is between 0 and 45.
Prediction and Results
# Prediction of model
pred_dt <- predict(object=model_dt, newdata=energy_test)
# RMSE value of model prediction
rmse_dt <- MLmetrics::RMSE(y_pred=pred_dt, y_true=energy_test$Heating.Load)
# R-square value of model prediction
R_dt <- R2_Score(y_pred = pred_dt, y_true = energy_test$Heating.Load)
Insight :
- The adjusted R-squared value of test data is 88.20%, implying that the variables predictor could explain 88.20% of the heating load as a variable target.
- The RMSE value in test data is 3.590. It is implied that the number is relatively small compared to heating ranges, which is between 0 and 45.
Random Forest
Random Forest consists of many decision trees that operate as an ensemble method. Each decision tree produces a prediction. The majority of the votes becomes the model prediction. The concept behind the random forest is bagging (bootstrap and aggregation) and feature randomness when constructing each decision tree to create an uncorrelated forest of the tree.
We create a decision tree by using function train() with method=rf. We save the model into RDS file.
set.seed(100, sample.kind = "Rounding")
# Create a random forest model
#model_rf_energy <- train(Heating.Load ~ .,
# method = "rf",
# trControl = control)
# Save the model into RDS file
#saveRDS(model_rf_energy, "model_rf.RDS")
After running the model and saved it into model_rf, we can read the file by using function readRDS() and observe the model.
# Read the forest_origin and assign it as a new abject called `model_rf`
model_rf <- readRDS("model_rf.RDS")
# View the model
model_rf
## Random Forest
##
## 614 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 493, 492, 490, 491, 490, 490, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.137 0.988 0.878
## 5 0.530 0.997 0.357
## 8 0.542 0.997 0.357
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
Insight :
- Based on the output, the model with mtry = 20 has the lowest RMSE value when tested on data from bootstrap sampling.
- The adjusted R-squared value of training data is 99.70%, implying that the variables predictor could explain 99.70% the heating load as a variable target.
- The RMSE value in training data is 0.53. It is implied that the number is very small compared to heating ranges, which is between 0 and 45.
Prediction and Results
The prediction can be generated by using function predict() with energy_test as our newdata.
# Model prediction
pred_rf <-predict(object=model_rf, newdata=energy_test)
# RMSE of model prediction
rmse_rf <- MLmetrics::RMSE(y_pred = pred_rf, y_true=energy_test$Heating.Load)
# R-square value of model prediction
R_rf <- R2_Score(y_pred = pred_rf, y_true = energy_test$Heating.Load)
Insight :
- The adjusted R-squared value of test data is 99.70%, implying that the variables predictor could explain 99.70% the heating load as a variable target.
- The RMSE value in test data is 0.547. It is implied that the number is very small compare to heating ranges, which is between 0 and 45.
Comparison
Insight : The output table suggests that Random Forest is the most appropriate model to predict heating load in this database. The model itself produces the lowest mean error (RMSE) and the highest R square value.
Bonus : Quantile Regression
Introduction
Quantile Regression Analysis is the extension of linear regression. The method is used when we are dealing with outliers, and an absence of homoscedasticity in the dataset. Linear regression estimates the conditional mean of the target variable for given variable predictors. Since the mean does not explain the whole distribution, modelling the mean does not justify the whole description of a correlation between target and predictor. In this case, we can use quantile regression to predict a percentile (quantile) for a given variable predictors. Thus, in this poject, we could find the relationship between heating load (target variable) with predictors that changes depending on which quantile we look at.
Equation of quantile regression:
\[\hat{y} = e_i + \beta_q*xi\]
where \(\beta_q\) is the vector of unknown variables related with the quantile.
Perform Modelling
The modelling process follows econometrics academy website
We perform quantile regression by using library(quantreg) and function of rq(). Now, we create the quantile regression model with tau at 0.25, and 0.75. We only use the predictor variables that have a significant correlation with target variable.
# Create quantile regression model with tau=0.25 and assign as `model_reg_0.25`
library(quantreg)
model_reg_0.25 <- rq(Heating.Load~Height+Roof.Area+Wall.Area+Compactness, data=energy, tau=0.25)
# Observe summary of the model
summary(model_reg_0.25)
##
## Call: rq(formula = Heating.Load ~ Height + Roof.Area + Wall.Area +
## Compactness, tau = 0.25, data = energy)
##
## tau: [1] 0.25
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 134.78579 50.78331 145.54142
## Height 2.73083 0.98883 3.94604
## Roof.Area -0.25845 -0.27705 -0.13835
## Wall.Area -0.05840 -0.06687 0.00175
## Compactness -85.31579 -91.36102 -29.71930
# Create quantile regression model with tau=0.75 and assign as `model_reg_0.75`
model_reg_0.75 <- rq(Heating.Load~Height+Roof.Area+Wall.Area+Compactness, data=energy, tau=0.75)
# Observe summary of the model
summary(model_reg_0.75)
##
## Call: rq(formula = Heating.Load ~ Height + Roof.Area + Wall.Area +
## Compactness, tau = 0.75, data = energy)
##
## tau: [1] 0.75
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 65.05500 33.81359 126.16754
## Height 5.40893 4.13621 6.20316
## Roof.Area -0.12832 -0.21164 -0.06562
## Wall.Area -0.01423 -0.05336 -0.00198
## Compactness -53.75000 -91.62690 -39.92697
The summary of quantile regression model with tau at 0.25 and 0.75, and linear regression model (model_ad_backward).
Interpretation:
- At quantile 0.25, one unit increase in height will increase the heating load by 2.73. While at quantile 0.75, one unit increase in height increases the heating load by 5.41. In another word, the effect of Height increases for a higher heating load (higher quantile).
- At quantile 0.25, one unit increase in roof area will reduce the heating load by 0.26. While at quantile 0.75, one unit increase in roof area reduce the heating load by 0.13. In another word, the effect of roof area increases for a higher heating load (higher quantile).
- At quantile 0.25, one unit increase in wall area will reduce the heating load by 0.058. While at quantile 0.75, one unit increase in wall area reduce the heating load by 0.014. In another word, the effect of wall area increase for a higher heating load (higher quantile).
- At quantile 0.25, one unit increase in compactness will reduce the heating load by 85.32. While at quantile 0.75, one unit increase in compactness reduce the heating load by 53.75. In another word, the effect of wall area increase for a higher heating load (higher quantile).
Assumption
In this assumption, we would like to check if the models with different tau obtain different coefficients. The expectation is to have a p-value less than alpha (p-value < 0.05) so that we will reject the \(H_0\). We will use anova() to check the p-value of all variable predictors. The hypothesis test:
\[ H_0: There\ is\ not\ different\\ H_1: There\ is\ different\ \]
anova(model_reg_0.25, model_reg_0.75)
## Quantile Regression Analysis of Deviance Table
##
## Model: Heating.Load ~ Height + Roof.Area + Wall.Area + Compactness
## Joint Test of Equality of Slopes: tau in { 0.25 0.75 }
##
## Df Resid Df F value Pr(>F)
## 1 4 1532 43.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Insight: p-value shows a signifant value, implying that the slope or coefficient is different in two models.
Summary
Let’s create the plot to observe the different effects along the distribution of the target variables.
# Create a quantile regression model.
model_reg <- rq(Heating.Load~Height+Roof.Area+Wall.Area+Compactness, data=energy, tau=seq(0.1, 0.9, by=0.1))
# Plot the model
plot(summary(model_reg))
- The black line shows the estimated quantile regression coefficient, plotted as a line varying across the quantiles with their confidence intervals(the light grey area).
- The red line is the estimated ordinary least squares (OLS) coefficient with its confidence interval.
- If the quantile regression coefficient is outside the OLS confidence interval, then there is a significant difference between OLS and quantile coefficient.
- The quantile coefficients for height are slightly different from the OLS coefficients, especially in a lower quantile (0.1 and 0.5) and a higher quantile (0.7 and 0.9).
- For the roof area, the differences are only valid in the higher quantile (0.8 and 0.9).
- In other plots (intercept, wall area and compactness), it appears the linear regression coefficient is sufficient to describe the relationship between these variables and the target. The quantile coefficient estimates are not statistically different from the OLS estimate.
Conclusion
- Linear regression method has been performed to predict the heating load of the building.
- Height, wall area, surface area, compactness, glazing area, and glazing distribution have a significant influence on the target variable.
- The prediction of the heating load by using the linear regression method shows that the model with the stepwise method is a better model than the lm() model due to having the highest value of an adjusted R-squared and the lowest value of error (RMSE).
- Linear regression method is not the most appropriate method for this dataset because the model violates three assumptions (multicolinearity, normality of residuals and homoscedasticity).
- The transformation on target and predictor has been done. However, it does not change the violation of the linear regression assumption.
- We had a total of 4 regression models : linear regression, ridge regression, decision tree and random forest. Random forest is the most appropriate model for predicting heating load with a RMSE of 0.547.
- The quantile regression model is used to find the relationship between target and predictors that changes depending on which quantile we look at.
- Only quantile coefficients for height and roof area have differences from the estimated OLS coefficients. While for the other variables, the linear regression coefficient is sufficient to describe the relationship between target and predictor.