Multiple Regression for Prediction of Energy Efficiency in Building Design

Introduction

As energy costs increase, attention has arisen in using machine learning models to assess various aspects of energy efficiency in building design (Gupta et al., 2023). For companies seeking to expand operations, accurate predictions of energy efficiency improve budget planning by identifying design strategies to reduce costs for building maintenance. In the residential sector, seeking increased energy efficiency in building design increases the competitive advantage, as energy efficient buildings reduce overall costs.

Salleh et al. (2023) approached the investigation of energy efficiency by using machine learning models to predict the indoor temperature of building types in Malaysia. Researchers used multiple regression models to analyze data comparing a energy efficient, clay brick residential building with a traditional residential building of concrete for both hot and cold seasons. In using the multiple regression model, researchers found that windows, walls, and building materials were the most predictive factors in indoor temperatures (Salleh et al., 2023).

Data simulation and computer simulated building design variations can be helpful in the determination of energy efficiency to predict future costs in maintaining comfortable indoor temperatures for individuals residing or working in buildings. From a historical perspective, Tsanas and Xifara (2012) proposed a method of energy efficiency data analysis in building design which served as the model for this project. Tsanas and Xifara (2012) used simulation to compare input variables of building design parameters for identification of the most important predictors in building efficiency. Using Ecotect, the researchers developed 12 building forms consisting of 18 elements, with identical building materials, volume, location, and use (residential). Simulated buildings varied in surface area, dimensions, and glazing area for a total of 720 simulated buildings for analysis. Histograms were used during exploration of the data characteristics to determine distribution, revealing distribution across variables which lacked normality. The researchers used both a regression model (iteratively reweighted least squares) and Random Forest classification models for prediction of energy efficiency and found optimal performance in using a Random Forest (RF) as compared to IRLS. The researchers attributed the improved performance to the ability of RF to manage complex relationships, manage multicollinearity effectively, and manage distributions that do not have a normal distribution (Tsanas and Xifara, 2012).

The purpose of this project is development of a multiple regression model to predict energy efficiency using an open-source benchmark dataset. Model performance will be optimized by addressing multicollinearity and performance metrics to tune parameters using the Energy Efficiency dataset available on the UCI repository (Tsanas and Xifara, 2012).

Set Up

install.packages("gtsummary", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'gtsummary' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\RtmpW0BYBR\downloaded_packages
install.packages("Metrics", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'Metrics' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\RtmpW0BYBR\downloaded_packages
#Load packages
library(readxl)
## Warning: package 'readxl' was built under R version 4.5.1
library(tidyverse)
## Warning: package 'readr' was built under R version 4.5.1
## Warning: package 'purrr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
library(boot)
library(ggplot2)
library(moments)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.1
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(rsample)
## Warning: package 'rsample' was built under R version 4.5.2
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.5.2
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.11     ✔ tailor       0.1.0 
## ✔ dials        1.4.2      ✔ tune         2.0.1 
## ✔ infer        1.1.0      ✔ workflows    1.3.0 
## ✔ modeldata    1.5.1      ✔ workflowsets 1.1.1 
## ✔ parsnip      1.4.0      ✔ yardstick    1.3.2 
## ✔ recipes      1.3.1
## Warning: package 'broom' was built under R version 4.5.2
## Warning: package 'dials' was built under R version 4.5.2
## Warning: package 'modeldata' was built under R version 4.5.2
## Warning: package 'parsnip' was built under R version 4.5.2
## Warning: package 'tailor' was built under R version 4.5.2
## Warning: package 'tune' was built under R version 4.5.2
## Warning: package 'workflows' was built under R version 4.5.2
## Warning: package 'workflowsets' was built under R version 4.5.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ recipes::step()          masks stats::step()
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(broom)
library(caret)
## Warning: package 'caret' was built under R version 4.5.1
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## 
## The following object is masked from 'package:boot':
## 
##     melanoma
## 
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## 
## The following object is masked from 'package:rsample':
## 
##     calibration
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.1
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.5.2
## 
## Attaching package: 'Metrics'
## 
## The following object is masked from 'package:pROC':
## 
##     auc
## 
## The following objects are masked from 'package:caret':
## 
##     precision, recall
## 
## The following objects are masked from 'package:yardstick':
## 
##     accuracy, mae, mape, mase, precision, recall, rmse, smape
library(carData)
## Warning: package 'carData' was built under R version 4.5.1
library(car)
## Warning: package 'car' was built under R version 4.5.1
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:boot':
## 
##     logit
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
#Set working directory and upload dataset
getwd()
## [1] "C:/Users/benke/Downloads/New folder (3)"
setwd("C:/Users/benke/Downloads/New folder (3)")
energy <- read_xlsx("ENB2012_data.xlsx")
head(energy)
## # A tibble: 6 × 10
##      X1    X2    X3    X4    X5    X6    X7    X8    Y1    Y2
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  0.98  514.  294   110.     7     2     0     0  15.6  21.3
## 2  0.98  514.  294   110.     7     3     0     0  15.6  21.3
## 3  0.98  514.  294   110.     7     4     0     0  15.6  21.3
## 4  0.98  514.  294   110.     7     5     0     0  15.6  21.3
## 5  0.9   564.  318.  122.     7     2     0     0  20.8  28.3
## 6  0.9   564.  318.  122.     7     3     0     0  21.5  25.4

Understanding the Data

#Rename variables for analysis
energy <- energy %>%
  rename(
    Relative_Compactness = X1,
    Surface_Area = X2,
    Wall_Area = X3,
    Roof_Area = X4,
    Overall_Height = X5,
    Orientation = X6,
    Glazing_Area = X7,
    Glazing_Area_Distribution = X8,
    Heating_Load = Y1,
    Cooling_Load = Y2
  )
head(energy)
## # A tibble: 6 × 10
##   Relative_Compactness Surface_Area Wall_Area Roof_Area Overall_Height
##                  <dbl>        <dbl>     <dbl>     <dbl>          <dbl>
## 1                 0.98         514.      294       110.              7
## 2                 0.98         514.      294       110.              7
## 3                 0.98         514.      294       110.              7
## 4                 0.98         514.      294       110.              7
## 5                 0.9          564.      318.      122.              7
## 6                 0.9          564.      318.      122.              7
## # ℹ 5 more variables: Orientation <dbl>, Glazing_Area <dbl>,
## #   Glazing_Area_Distribution <dbl>, Heating_Load <dbl>, Cooling_Load <dbl>
#View characteristics of the dataset
str(energy)
## tibble [768 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Relative_Compactness     : num [1:768] 0.98 0.98 0.98 0.98 0.9 0.9 0.9 0.9 0.86 0.86 ...
##  $ Surface_Area             : num [1:768] 514 514 514 514 564 ...
##  $ Wall_Area                : num [1:768] 294 294 294 294 318 ...
##  $ Roof_Area                : num [1:768] 110 110 110 110 122 ...
##  $ Overall_Height           : num [1:768] 7 7 7 7 7 7 7 7 7 7 ...
##  $ Orientation              : num [1:768] 2 3 4 5 2 3 4 5 2 3 ...
##  $ Glazing_Area             : num [1:768] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Glazing_Area_Distribution: num [1:768] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Heating_Load             : num [1:768] 15.6 15.6 15.6 15.6 20.8 ...
##  $ Cooling_Load             : num [1:768] 21.3 21.3 21.3 21.3 28.3 ...
glimpse(energy)
## Rows: 768
## Columns: 10
## $ Relative_Compactness      <dbl> 0.98, 0.98, 0.98, 0.98, 0.90, 0.90, 0.90, 0.…
## $ Surface_Area              <dbl> 514.5, 514.5, 514.5, 514.5, 563.5, 563.5, 56…
## $ Wall_Area                 <dbl> 294.0, 294.0, 294.0, 294.0, 318.5, 318.5, 31…
## $ Roof_Area                 <dbl> 110.25, 110.25, 110.25, 110.25, 122.50, 122.…
## $ Overall_Height            <dbl> 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0,…
## $ Orientation               <dbl> 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4,…
## $ Glazing_Area              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Glazing_Area_Distribution <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Heating_Load              <dbl> 15.55, 15.55, 15.55, 15.55, 20.84, 21.46, 20…
## $ Cooling_Load              <dbl> 21.33, 21.33, 21.33, 21.33, 28.28, 25.38, 25…
#View summary statistics of the dataset
summary(energy)
##  Relative_Compactness  Surface_Area     Wall_Area       Roof_Area    
##  Min.   :0.6200       Min.   :514.5   Min.   :245.0   Min.   :110.2  
##  1st Qu.:0.6825       1st Qu.:606.4   1st Qu.:294.0   1st Qu.:140.9  
##  Median :0.7500       Median :673.8   Median :318.5   Median :183.8  
##  Mean   :0.7642       Mean   :671.7   Mean   :318.5   Mean   :176.6  
##  3rd Qu.:0.8300       3rd Qu.:741.1   3rd Qu.:343.0   3rd Qu.:220.5  
##  Max.   :0.9800       Max.   :808.5   Max.   :416.5   Max.   :220.5  
##  Overall_Height  Orientation    Glazing_Area    Glazing_Area_Distribution
##  Min.   :3.50   Min.   :2.00   Min.   :0.0000   Min.   :0.000            
##  1st Qu.:3.50   1st Qu.:2.75   1st Qu.:0.1000   1st Qu.:1.750            
##  Median :5.25   Median :3.50   Median :0.2500   Median :3.000            
##  Mean   :5.25   Mean   :3.50   Mean   :0.2344   Mean   :2.812            
##  3rd Qu.:7.00   3rd Qu.:4.25   3rd Qu.:0.4000   3rd Qu.:4.000            
##  Max.   :7.00   Max.   :5.00   Max.   :0.4000   Max.   :5.000            
##   Heating_Load    Cooling_Load  
##  Min.   : 6.01   Min.   :10.90  
##  1st Qu.:12.99   1st Qu.:15.62  
##  Median :18.95   Median :22.08  
##  Mean   :22.31   Mean   :24.59  
##  3rd Qu.:31.67   3rd Qu.:33.13  
##  Max.   :43.10   Max.   :48.03

Data exploration and data visualization was performed to increase familiarity with the variables, data types, and shape of the distribution. The dataset is comprised of eight prediction variables (Relative Compactness, Surface Area, Wall Area, Roof Area, Overall Height, Orientation, Glazing Area, and Glazing Area Distribution) and two response variables (Heating Load and Cooling Load). The dataset includes 768 observations. All variables were numeric variables and continuous. No missing values were detected. A summary table was completed which revealed minimum and maximum values relatively close to the first and third quartiles; therefore, the dataset was determined to have values within the expected range and without outliers requiring further management.

Histogram visualizations were completed to view the distribution for each of the prediction values. Histogram graphs show the lack of normal distribution for four of the prediction values used in the final model.

#Check document for missing values
na_energy <- sum(is.na(energy))
na_energy
## [1] 0
#Histogram visualization of each variable
#Relative Compactness
ggplot(energy, aes(x = Relative_Compactness)) +
  geom_histogram(binwidth = .1, color = "blue", fill = "darkred") +
  labs(title = "Histogram for Relative Compactness", y = "Frequency") +
  theme_gray(
  )

#Histogram for Surface Area
ggplot(energy, aes(x = Surface_Area)) +
  geom_histogram(binwidth = 25, color = "blue", fill = "darkred") +
  labs(title = "Histogram for Surface Area", y = "Frequency") +
  theme_gray(
  )

#Histogram for wall area
ggplot(energy, aes(x = Wall_Area)) +
  geom_histogram(binwidth = 25, color = "blue", fill = "darkred") +
  labs(title = "Histogram for Wall Area", y = "Frequency") +
  theme_gray(
  )

#Histogram for roof area
ggplot(energy, aes(x = Roof_Area)) +
  geom_histogram(binwidth = 25, color = "blue", fill = "darkred") +
  labs(title = "Histogram for Roof Area", y = "Frequency") +
  theme_gray(
  )

#Histogram for overall height
ggplot(energy, aes(x = Overall_Height)) +
  geom_histogram(binwidth = 1, color = "blue", fill = "darkred") +
  labs(title = "Histogram for Overall Height", y = "Frequency") +
  theme_gray(
  )

#Histogram for orientation
ggplot(energy, aes(x = Orientation)) +
  geom_histogram(binwidth = .25, color = "blue", fill = "darkred") +
  labs(title = "Histogram for Orientation", y = "Frequency") +
  theme_gray(
  )

#Histogram for glazing area
ggplot(energy, aes(x = Glazing_Area)) +
  geom_histogram(binwidth = .05, color = "blue", fill = "darkred") +
  labs(title = "Histogram for Glazing Area", y = "Frequency") +
  theme_gray(
  )

#Histogram for glazing area distribution
ggplot(energy, aes(x = Glazing_Area_Distribution)) +
  geom_histogram(binwidth = .5, color = "blue", fill = "darkred") +
  labs(title = "Histogram for Glazing Area Distribution", y = "Frequency") +
  theme_gray(
  )

Correlations

A pairs plot was also created to view relationships between variables. Additionally, literature addressing the relationship between prediction variables and building efficiency required use of the Spearman correlation computation (Tsanas and Xifara, 2012, Gupta et al, 2023). The Spearman rank correlation utilizes the ranks of the n pairs rather than raw data to determine the correlation between variables and is useful when distributions lack normality.

#Create a pairs plot
pairs(energy)

The strength and direction of the relationships between variables are completed through inspection of correlation coefficients and a heat map matrix. A high positive correlation can be seen between response variables and Overall Height. Strong inverse correlations are evident between response variables and Roof Area. The heat map provides a visual representation of correlations. In addition to insights gained from the correlation table, a moderate relationship is identified between Wall Area and Relative Compactness with both Heating and Cooling Load. A moderate inverse relationship is identified between Surface Area and both Heating and Cooling Load.

#View correlations
cor_energy <- cor(energy)
cor_energy
##                           Relative_Compactness  Surface_Area  Wall_Area
## Relative_Compactness              1.000000e+00 -9.919015e-01 -0.2037817
## Surface_Area                     -9.919015e-01  1.000000e+00  0.1955016
## Wall_Area                        -2.037817e-01  1.955016e-01  1.0000000
## Roof_Area                        -8.688234e-01  8.807195e-01 -0.2923165
## Overall_Height                    8.277473e-01 -8.581477e-01  0.2809757
## Orientation                       0.000000e+00  0.000000e+00  0.0000000
## Glazing_Area                      7.617400e-20  4.664140e-20  0.0000000
## Glazing_Area_Distribution         0.000000e+00  0.000000e+00  0.0000000
## Heating_Load                      6.222719e-01 -6.581199e-01  0.4556714
## Cooling_Load                      6.343391e-01 -6.729989e-01  0.4271170
##                               Roof_Area Overall_Height  Orientation
## Relative_Compactness      -8.688234e-01      0.8277473  0.000000000
## Surface_Area               8.807195e-01     -0.8581477  0.000000000
## Wall_Area                 -2.923165e-01      0.2809757  0.000000000
## Roof_Area                  1.000000e+00     -0.9725122  0.000000000
## Overall_Height            -9.725122e-01      1.0000000  0.000000000
## Orientation                0.000000e+00      0.0000000  1.000000000
## Glazing_Area              -1.197187e-19      0.0000000  0.000000000
## Glazing_Area_Distribution  0.000000e+00      0.0000000  0.000000000
## Heating_Load              -8.618281e-01      0.8894305 -0.002586763
## Cooling_Load              -8.625466e-01      0.8957852  0.014289598
##                            Glazing_Area Glazing_Area_Distribution Heating_Load
## Relative_Compactness       7.617400e-20                0.00000000  0.622271936
## Surface_Area               4.664140e-20                0.00000000 -0.658119917
## Wall_Area                  0.000000e+00                0.00000000  0.455671365
## Roof_Area                 -1.197187e-19                0.00000000 -0.861828052
## Overall_Height             0.000000e+00                0.00000000  0.889430464
## Orientation                0.000000e+00                0.00000000 -0.002586763
## Glazing_Area               1.000000e+00                0.21296422  0.269841685
## Glazing_Area_Distribution  2.129642e-01                1.00000000  0.087368460
## Heating_Load               2.698417e-01                0.08736846  1.000000000
## Cooling_Load               2.075050e-01                0.05052512  0.975861739
##                           Cooling_Load
## Relative_Compactness        0.63433907
## Surface_Area               -0.67299893
## Wall_Area                   0.42711700
## Roof_Area                  -0.86254660
## Overall_Height              0.89578517
## Orientation                 0.01428960
## Glazing_Area                0.20750499
## Glazing_Area_Distribution   0.05052512
## Heating_Load                0.97586174
## Cooling_Load                1.00000000
#Compare with Spearman correlation
energy_sp_cor <- cor(energy, method = "spearman")
energy_sp_cor
##                           Relative_Compactness Surface_Area  Wall_Area
## Relative_Compactness                 1.0000000   -1.0000000 -0.2558053
## Surface_Area                        -1.0000000    1.0000000  0.2558053
## Wall_Area                           -0.2558053    0.2558053  1.0000000
## Roof_Area                           -0.8708862    0.8708862 -0.1934568
## Overall_Height                       0.8690482   -0.8690482  0.2207634
## Orientation                          0.0000000    0.0000000  0.0000000
## Glazing_Area                         0.0000000    0.0000000  0.0000000
## Glazing_Area_Distribution            0.0000000    0.0000000  0.0000000
## Heating_Load                         0.6221347   -0.6221347  0.4714576
## Cooling_Load                         0.6510195   -0.6510195  0.4159908
##                            Roof_Area Overall_Height  Orientation Glazing_Area
## Relative_Compactness      -0.8708862      0.8690482  0.000000000    0.0000000
## Surface_Area               0.8708862     -0.8690482  0.000000000    0.0000000
## Wall_Area                 -0.1934568      0.2207634  0.000000000    0.0000000
## Roof_Area                  1.0000000     -0.9370426  0.000000000    0.0000000
## Overall_Height            -0.9370426      1.0000000  0.000000000    0.0000000
## Orientation                0.0000000      0.0000000  1.000000000    0.0000000
## Glazing_Area               0.0000000      0.0000000  0.000000000    1.0000000
## Glazing_Area_Distribution  0.0000000      0.0000000  0.000000000    0.1875916
## Heating_Load              -0.8040270      0.8612826 -0.004168324    0.3228603
## Cooling_Load              -0.8031746      0.8648761  0.017605701    0.2889045
##                           Glazing_Area_Distribution Heating_Load Cooling_Load
## Relative_Compactness                     0.00000000  0.622134693    0.6510195
## Surface_Area                             0.00000000 -0.622134693   -0.6510195
## Wall_Area                                0.00000000  0.471457647    0.4159908
## Roof_Area                                0.00000000 -0.804026995   -0.8031746
## Overall_Height                           0.00000000  0.861282571    0.8648761
## Orientation                              0.00000000 -0.004168324    0.0176057
## Glazing_Area                             0.18759162  0.322860318    0.2889045
## Glazing_Area_Distribution                1.00000000  0.068343463    0.0464770
## Heating_Load                             0.06834346  1.000000000    0.9726893
## Cooling_Load                             0.04647700  0.972689315    1.0000000
#Create a table of Spearman Correlations for report
energy_sp_cor %>%
  kbl(caption = "Spearman Correlation") %>%
  kable_classic()
Spearman Correlation
Relative_Compactness Surface_Area Wall_Area Roof_Area Overall_Height Orientation Glazing_Area Glazing_Area_Distribution Heating_Load Cooling_Load
Relative_Compactness 1.0000000 -1.0000000 -0.2558053 -0.8708862 0.8690482 0.0000000 0.0000000 0.0000000 0.6221347 0.6510195
Surface_Area -1.0000000 1.0000000 0.2558053 0.8708862 -0.8690482 0.0000000 0.0000000 0.0000000 -0.6221347 -0.6510195
Wall_Area -0.2558053 0.2558053 1.0000000 -0.1934568 0.2207634 0.0000000 0.0000000 0.0000000 0.4714576 0.4159908
Roof_Area -0.8708862 0.8708862 -0.1934568 1.0000000 -0.9370426 0.0000000 0.0000000 0.0000000 -0.8040270 -0.8031746
Overall_Height 0.8690482 -0.8690482 0.2207634 -0.9370426 1.0000000 0.0000000 0.0000000 0.0000000 0.8612826 0.8648761
Orientation 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000 -0.0041683 0.0176057
Glazing_Area 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.1875916 0.3228603 0.2889045
Glazing_Area_Distribution 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.1875916 1.0000000 0.0683435 0.0464770
Heating_Load 0.6221347 -0.6221347 0.4714576 -0.8040270 0.8612826 -0.0041683 0.3228603 0.0683435 1.0000000 0.9726893
Cooling_Load 0.6510195 -0.6510195 0.4159908 -0.8031746 0.8648761 0.0176057 0.2889045 0.0464770 0.9726893 1.0000000
#Create correlation matrix
ggcorrplot::ggcorrplot(cor_energy, method = "circle",
                       type = "lower",
                       lab = TRUE,
                       lab_size = 3,
                       outline.color = "white",
                       ggtheme = theme_gray)

#Create correlation matrix with Spearman correlation coefficients
#Create correlation matrix
ggcorrplot::ggcorrplot(energy_sp_cor, method = "circle",
                       type = "lower",
                       lab = TRUE,
                       lab_size = 3,
                       outline.color = "white",
                       ggtheme = theme_gray)

Data Visualization of Relationships Between Variables

Scatter plot graphs were created with regression lines for all variables for both response variables (Heating Load and Cooling Load) to visualize relationships between the prediction and response variables. Figure 7 and Figure 8 demonstrate the inverse relationship between Surface Area and Heating Load and the positive relationship between Wall Area and Heating Load.

#Create scatter plots for highly correlated variables
ggplot(energy, aes(x = Overall_Height, y = Heating_Load)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "darkred") +
  labs(title = "Overall Height versus Heating Load",
       x = "Overall Height",
       y = "Heating Load") +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(energy, aes(x = Roof_Area, y = Heating_Load)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "darkred") +
  labs(title = "Roof Area versus Heating Load",
       x = "Roof Area",
       y = "Heating Load") +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(energy, aes(x = Relative_Compactness, y = Heating_Load)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "darkred") +
  labs(title = "Relative Compactness versus Heating Load",
       x = "Relative Compactness",
       y = "Heating Load") +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(energy, aes(x = Surface_Area, y = Heating_Load)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "darkred") +
  labs(title = "Surface Area versus Heating Load",
       x = "Surface Area",
       y = "Heating Load") +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(energy, aes(x = Wall_Area, y = Heating_Load)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "darkred") +
  labs(title = "Wall Area versus Heating Load",
       x = "Wall Area",
       y = "Heating Load") +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(energy, aes(x = Overall_Height, y = Cooling_Load)) +
  geom_point(color = "darkred", size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "blue") +
  labs(title = "Overall Height versus Cooling Load",
       x = "Overall Height",
       y = "Cooling Load") +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(energy, aes(x = Roof_Area, y = Cooling_Load)) +
  geom_point(color = "darkred", size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "blue") +
  labs(title = "Roof Area versus Cooling Load",
       x = "Roof Area",
       y = "Cooling Load") +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(energy, aes(x = Relative_Compactness, y = Cooling_Load)) +
  geom_point(color = "darkred", size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "blue") +
  labs(title = "Relative Compactness versus Cooling Load",
       x = "Relative Compactness",
       y = "Cooling Load") +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(energy, aes(x = Surface_Area, y = Cooling_Load)) +
  geom_point(color = "darkred", size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "blue") +
  labs(title = "Surface Area versus Cooling Load",
       x = "Surface Area",
       y = "Cooling Load") +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(energy, aes(x = Wall_Area, y = Cooling_Load)) +
  geom_point(color = "darkred", size = 3) +
  geom_smooth(method = lm, se = FALSE, color = "blue") +
  labs(title = "Wall Area versus Cooling Load",
       x = "Wall Area",
       y = "Cooling Load") +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

Preprocessing

Descriptive statistics were calculated for analysis of shape statistics and to determine if further pre-processing was required prior to model development. Table 2 represents the descriptive statistics. High variance was observed for Surface Area, Wall Area and Roof Area. Additionally, value ranges differed across variables. Because of the high variance and difference in value ranges across variables, min-max scaling was used for transformation of the data prior to use in models.

#View statistics of dataset
#Create function to compute statistics 
compute_stats <- function(column, name) {
  if (is.numeric(column) || is.integer(column)) {
    data.frame(
      Variable = name,
      Mean = round(mean(column, na.rm = TRUE), 2),
      Median = round(median(column, na.rm = TRUE), 2),
      St.Deviation = round(sd(column, na.rm = TRUE), 2),
      Variance = round(var(column, na.rm = TRUE), 2),
      Range = round(diff(range(column, na.rm = TRUE)), 2),
      IQR = round(IQR(column, na.rm = TRUE), 2),
      Skewness = round(skewness(column, na.rm = TRUE), 2),
      Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
      stringsAsFactors = FALSE
    )
  } else {
    NULL
  }
}

descriptive_stats <- do.call(
  rbind,
  lapply(names(energy), function(col) compute_stats(energy[[col]], col))
)

descriptive_stats
##                     Variable   Mean Median St.Deviation Variance  Range    IQR
## 1       Relative_Compactness   0.76   0.75         0.11     0.01   0.36   0.15
## 2               Surface_Area 671.71 673.75        88.09  7759.16 294.00 134.75
## 3                  Wall_Area 318.50 318.50        43.63  1903.27 171.50  49.00
## 4                  Roof_Area 176.60 183.75        45.17  2039.96 110.25  79.62
## 5             Overall_Height   5.25   5.25         1.75     3.07   3.50   3.50
## 6                Orientation   3.50   3.50         1.12     1.25   3.00   1.50
## 7               Glazing_Area   0.23   0.25         0.13     0.02   0.40   0.30
## 8  Glazing_Area_Distribution   2.81   3.00         1.55     2.41   5.00   2.25
## 9               Heating_Load  22.31  18.95        10.09   101.81  37.09  18.68
## 10              Cooling_Load  24.59  22.08         9.51    90.50  37.13  17.51
##    Skewness Kurtosis
## 1      0.49     2.29
## 2     -0.12     1.94
## 3      0.53     3.11
## 4     -0.16     1.23
## 5      0.00     1.00
## 6      0.00     1.64
## 7     -0.06     1.67
## 8     -0.09     1.85
## 9      0.36     1.75
## 10     0.40     1.85
descriptive_stats %>%
  kbl(caption = "Descriptive Statistics before Min-Max Scaling") %>%
  kable_classic()
Descriptive Statistics before Min-Max Scaling
Variable Mean Median St.Deviation Variance Range IQR Skewness Kurtosis
Relative_Compactness 0.76 0.75 0.11 0.01 0.36 0.15 0.49 2.29
Surface_Area 671.71 673.75 88.09 7759.16 294.00 134.75 -0.12 1.94
Wall_Area 318.50 318.50 43.63 1903.27 171.50 49.00 0.53 3.11
Roof_Area 176.60 183.75 45.17 2039.96 110.25 79.62 -0.16 1.23
Overall_Height 5.25 5.25 1.75 3.07 3.50 3.50 0.00 1.00
Orientation 3.50 3.50 1.12 1.25 3.00 1.50 0.00 1.64
Glazing_Area 0.23 0.25 0.13 0.02 0.40 0.30 -0.06 1.67
Glazing_Area_Distribution 2.81 3.00 1.55 2.41 5.00 2.25 -0.09 1.85
Heating_Load 22.31 18.95 10.09 101.81 37.09 18.68 0.36 1.75
Cooling_Load 24.59 22.08 9.51 90.50 37.13 17.51 0.40 1.85

The dataset was split for training and testing for validation, with 80% of the data used for training and the remaining 20% of the data used for testing. Min-max scaling was then applied to both the training and testing datasets.

#Split dataset for training and testing since normalization and scaling will be required
set.seed(123)
energy_split <- initial_split(energy, prop = .80)

train_energy <- training(energy_split)
test_energy <- testing(energy_split)

head(train_energy)
## # A tibble: 6 × 10
##   Relative_Compactness Surface_Area Wall_Area Roof_Area Overall_Height
##                  <dbl>        <dbl>     <dbl>     <dbl>          <dbl>
## 1                 0.71         710.      270.      220.            3.5
## 2                 0.71         710.      270.      220.            3.5
## 3                 0.69         735       294       220.            3.5
## 4                 0.62         808.      368.      220.            3.5
## 5                 0.98         514.      294       110.            7  
## 6                 0.76         662.      416.      122.            7  
## # ℹ 5 more variables: Orientation <dbl>, Glazing_Area <dbl>,
## #   Glazing_Area_Distribution <dbl>, Heating_Load <dbl>, Cooling_Load <dbl>
#Apply Min-max scaling to train dataset
#Create the function for Min-max scaling
min_max_norm <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

#Complete scaling
scl_train_energy <- min_max_norm(train_energy)
head(scl_train_energy)
##   Relative_Compactness Surface_Area Wall_Area Roof_Area Overall_Height
## 1         0.0008781694    0.8787879 0.3333333 0.2727273    0.004329004
## 2         0.0008781694    0.8787879 0.3333333 0.2727273    0.004329004
## 3         0.0008534323    0.9090909 0.3636364 0.2727273    0.004329004
## 4         0.0007668522    1.0000000 0.4545455 0.2727273    0.004329004
## 5         0.0012121212    0.6363636 0.3636364 0.1363636    0.008658009
## 6         0.0009400124    0.8181818 0.5151515 0.1515152    0.008658009
##   Orientation Glazing_Area Glazing_Area_Distribution Heating_Load Cooling_Load
## 1 0.004947434 0.0003092146               0.003710575   0.01496599   0.01925788
## 2 0.004947434 0.0003092146               0.004947434   0.01505257   0.01880025
## 3 0.004947434 0.0001236858               0.003710575   0.01372913   0.01814471
## 4 0.003710575 0.0003092146               0.006184292   0.01730365   0.01807050
## 5 0.004947434 0.0001236858               0.004947434   0.02973408   0.03238095
## 6 0.003710575 0.0001236858               0.002473717   0.04007421   0.04213976
#Scale test dataset
scl_test_energy <- min_max_norm(test_energy)
head(scl_test_energy)
##   Relative_Compactness Surface_Area Wall_Area Roof_Area Overall_Height
## 1         0.0012121212    0.6363636 0.3636364 0.1363636    0.008658009
## 2         0.0012121212    0.6363636 0.3636364 0.1363636    0.008658009
## 3         0.0010636982    0.7272727 0.3636364 0.1818182    0.008658009
## 4         0.0009771181    0.7878788 0.4242424 0.1818182    0.008658009
## 5         0.0009400124    0.8181818 0.5151515 0.1515152    0.008658009
## 6         0.0009152752    0.8484848 0.3030303 0.2727273    0.004329004
##   Orientation Glazing_Area Glazing_Area_Distribution Heating_Load Cooling_Load
## 1 0.002473717            0                         0  0.019233148   0.02638219
## 2 0.004947434            0                         0  0.019233148   0.02638219
## 3 0.002473717            0                         0  0.024118738   0.03376623
## 4 0.002473717            0                         0  0.035275201   0.04666667
## 5 0.003710575            0                         0  0.029598021   0.03670996
## 6 0.004947434            0                         0  0.007433519   0.01353123
#Compute Spearman correlation with standardized values in train dataset
scl_train_energy_sp_cor <- cor(scl_train_energy, method = "spearman")
scl_train_energy_sp_cor
##                           Relative_Compactness Surface_Area    Wall_Area
## Relative_Compactness                1.00000000  -1.00000000 -0.246176051
## Surface_Area                       -1.00000000   1.00000000  0.246176051
## Wall_Area                          -0.24617605   0.24617605  1.000000000
## Roof_Area                          -0.86733696   0.86733696 -0.209243544
## Overall_Height                      0.86899298  -0.86899298  0.235499636
## Orientation                         0.01875646  -0.01875646 -0.018446387
## Glazing_Area                        0.01617129  -0.01617129 -0.010571870
## Glazing_Area_Distribution          -0.03119438   0.03119438 -0.006798257
## Heating_Load                        0.61916118  -0.61916118  0.485809857
## Cooling_Load                        0.65166422  -0.65166422  0.423784914
##                              Roof_Area Overall_Height  Orientation Glazing_Area
## Relative_Compactness      -0.867336963    0.868992984  0.018756456   0.01617129
## Surface_Area               0.867336963   -0.868992984 -0.018756456  -0.01617129
## Wall_Area                 -0.209243544    0.235499636 -0.018446387  -0.01057187
## Roof_Area                  1.000000000   -0.939375027 -0.007250441  -0.01481988
## Overall_Height            -0.939375027    1.000000000  0.005694866   0.01210663
## Orientation               -0.007250441    0.005694866  1.000000000  -0.01620949
## Glazing_Area              -0.014819876    0.012106633 -0.016209492   1.00000000
## Glazing_Area_Distribution  0.027661046   -0.038278784 -0.005434524   0.16956722
## Heating_Load              -0.805011249    0.861893643 -0.011497569   0.31953009
## Cooling_Load              -0.804094282    0.865072194  0.011868623   0.29276062
##                           Glazing_Area_Distribution Heating_Load Cooling_Load
## Relative_Compactness                   -0.031194383   0.61916118   0.65166422
## Surface_Area                            0.031194383  -0.61916118  -0.65166422
## Wall_Area                              -0.006798257   0.48580986   0.42378491
## Roof_Area                               0.027661046  -0.80501125  -0.80409428
## Overall_Height                         -0.038278784   0.86189364   0.86507219
## Orientation                            -0.005434524  -0.01149757   0.01186862
## Glazing_Area                            0.169567215   0.31953009   0.29276062
## Glazing_Area_Distribution               1.000000000   0.03095670   0.02185045
## Heating_Load                            0.030956698   1.00000000   0.97465614
## Cooling_Load                            0.021850446   0.97465614   1.00000000
#Review descriptive statistics of scaled values
descriptive_stats_2 <- do.call(
  rbind,
  lapply(names(energy), function(col) compute_stats(scl_train_energy[[col]], col))
)

descriptive_stats_2
##                     Variable Mean Median St.Deviation Variance Range  IQR
## 1       Relative_Compactness 0.00   0.00         0.00     0.00  0.00 0.00
## 2               Surface_Area 0.83   0.85         0.11     0.01  0.36 0.18
## 3                  Wall_Area 0.39   0.39         0.05     0.00  0.21 0.06
## 4                  Roof_Area 0.22   0.27         0.06     0.00  0.14 0.09
## 5             Overall_Height 0.01   0.00         0.00     0.00  0.00 0.00
## 6                Orientation 0.00   0.00         0.00     0.00  0.00 0.00
## 7               Glazing_Area 0.00   0.00         0.00     0.00  0.00 0.00
## 8  Glazing_Area_Distribution 0.00   0.00         0.00     0.00  0.01 0.00
## 9               Heating_Load 0.03   0.02         0.01     0.00  0.05 0.02
## 10              Cooling_Load 0.03   0.03         0.01     0.00  0.05 0.02
##    Skewness Kurtosis
## 1      0.52     2.36
## 2     -0.15     1.98
## 3      0.49     3.03
## 4     -0.19     1.23
## 5      0.03     1.00
## 6      0.01     1.65
## 7     -0.03     1.66
## 8     -0.14     1.87
## 9      0.37     1.74
## 10     0.41     1.86
descriptive_stats_2 %>%
  kbl(caption = "Descriptive Statistics After Min-Max Scaling") %>%
  kable_classic()
Descriptive Statistics After Min-Max Scaling
Variable Mean Median St.Deviation Variance Range IQR Skewness Kurtosis
Relative_Compactness 0.00 0.00 0.00 0.00 0.00 0.00 0.52 2.36
Surface_Area 0.83 0.85 0.11 0.01 0.36 0.18 -0.15 1.98
Wall_Area 0.39 0.39 0.05 0.00 0.21 0.06 0.49 3.03
Roof_Area 0.22 0.27 0.06 0.00 0.14 0.09 -0.19 1.23
Overall_Height 0.01 0.00 0.00 0.00 0.00 0.00 0.03 1.00
Orientation 0.00 0.00 0.00 0.00 0.00 0.00 0.01 1.65
Glazing_Area 0.00 0.00 0.00 0.00 0.00 0.00 -0.03 1.66
Glazing_Area_Distribution 0.00 0.00 0.00 0.00 0.01 0.00 -0.14 1.87
Heating_Load 0.03 0.02 0.01 0.00 0.05 0.02 0.37 1.74
Cooling_Load 0.03 0.03 0.01 0.00 0.05 0.02 0.41 1.86

Initial Model

The first model incorporated all prediction variables for the response variable of Heating Load. Tsanas and Xifara (2012), in the introduction of this dataset, explained all variables carry essential information. Due to the inclusion of all variables in the literature, all variables were included in the first model.

#The first model used all variables as discussed in reference literature
model1 <- lm(Heating_Load ~ Relative_Compactness + Surface_Area + Wall_Area + Roof_Area + Overall_Height     + Orientation + Glazing_Area + Glazing_Area_Distribution, data = scl_train_energy)
summary(model1)
## 
## Call:
## lm(formula = Heating_Load ~ Relative_Compactness + Surface_Area + 
##     Wall_Area + Roof_Area + Overall_Height + Orientation + Glazing_Area + 
##     Glazing_Area_Distribution, data = scl_train_energy)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0125472 -0.0017543 -0.0000734  0.0016961  0.0092304 
## 
## Coefficients: (1 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.104433   0.026480   3.944 8.96e-05 ***
## Relative_Compactness      -65.831245  11.588400  -5.681 2.08e-08 ***
## Surface_Area               -0.086061   0.019207  -4.481 8.89e-06 ***
## Wall_Area                   0.058097   0.007524   7.722 4.76e-14 ***
## Roof_Area                         NA         NA      NA       NA    
## Overall_Height              4.323712   0.382928  11.291  < 2e-16 ***
## Orientation                -0.077189   0.107636  -0.717  0.47357    
## Glazing_Area               19.075102   0.919999  20.734  < 2e-16 ***
## Glazing_Area_Distribution   0.229850   0.079387   2.895  0.00392 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003671 on 606 degrees of freedom
## Multiple R-squared:  0.916,  Adjusted R-squared:  0.915 
## F-statistic: 943.5 on 7 and 606 DF,  p-value: < 2.2e-16
anova(model1)
## Analysis of Variance Table
## 
## Response: Heating_Load
##                            Df   Sum Sq  Mean Sq   F value    Pr(>F)    
## Relative_Compactness        1 0.037522 0.037522 2783.6074 < 2.2e-16 ***
## Surface_Area                1 0.010380 0.010380  770.0220 < 2.2e-16 ***
## Wall_Area                   1 0.032993 0.032993 2447.6144 < 2.2e-16 ***
## Overall_Height              1 0.001642 0.001642  121.7934 < 2.2e-16 ***
## Orientation                 1 0.000016 0.000016    1.1656  0.280745    
## Glazing_Area                1 0.006364 0.006364  472.1384 < 2.2e-16 ***
## Glazing_Area_Distribution   1 0.000113 0.000113    8.3828  0.003924 ** 
## Residuals                 606 0.008169 0.000013                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

On the initial inspection, the first model appeared to have optimal results, with an R-squared of 0.916 and an adjusted R-squared of 0.915 indicating that 92% of the variance in the response variable of Heating Load was predicted by the variables included in the model. The p-value for the model indicates a significant relationship between the prediction and the dependent variable.

#Create table of model 1 results
model1_tbl <- tbl_regression(model1, intercept = TRUE)
model1_tbl
Characteristic Beta 95% CI p-value
(Intercept) 0.10 0.05, 0.16 <0.001
Relative_Compactness -66 -89, -43 <0.001
Surface_Area -0.09 -0.12, -0.05 <0.001
Wall_Area 0.06 0.04, 0.07 <0.001
Roof_Area


Overall_Height 4.3 3.6, 5.1 <0.001
Orientation -0.08 -0.29, 0.13 0.5
Glazing_Area 19 17, 21 <0.001
Glazing_Area_Distribution 0.23 0.07, 0.39 0.004
Abbreviation: CI = Confidence Interval
tidy_model1 <- tidy(model1)
print(tidy_model1)
## # A tibble: 9 × 5
##   term                      estimate std.error statistic   p.value
##   <chr>                        <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)                 0.104    0.0265      3.94   8.96e- 5
## 2 Relative_Compactness      -65.8     11.6        -5.68   2.08e- 8
## 3 Surface_Area               -0.0861   0.0192     -4.48   8.89e- 6
## 4 Wall_Area                   0.0581   0.00752     7.72   4.76e-14
## 5 Roof_Area                  NA       NA          NA     NA       
## 6 Overall_Height              4.32     0.383      11.3    5.78e-27
## 7 Orientation                -0.0772   0.108      -0.717  4.74e- 1
## 8 Glazing_Area               19.1      0.920      20.7    1.41e-72
## 9 Glazing_Area_Distribution   0.230    0.0794      2.90   3.92e- 3
model1_stats <- glance(model1) %>%
  transmute(
    'R-squared' = r.squared,
    'Adj. R-squared' = adj.r.squared,
    'F-statistic' = statistic,
    'F p-value' = p.value,
    'Model df' = df,
    'Residual df' = df.residual
  )

model1_stats
## # A tibble: 1 × 6
##   `R-squared` `Adj. R-squared` `F-statistic` `F p-value` `Model df`
##         <dbl>            <dbl>         <dbl>       <dbl>      <dbl>
## 1       0.916            0.915          944.   5.18e-321          7
## # ℹ 1 more variable: `Residual df` <int>
model1_stats %>%
  kbl(caption = "Model 1 Results") %>%
  kable_classic()
Model 1 Results
R-squared Adj. R-squared F-statistic F p-value Model df Residual df
0.9159585 0.9149877 943.532 0 7 606

Evaluate accuracy of model 1 The first model was then applied to the testing data with the MAE calculated to determine model performance.

#Make predictions on the testing data
energy_predict <- predict(model1, newdata = scl_test_energy)

#Evaluate performance metrics
mae_heating <- mae(actual = scl_test_energy$Heating_Load, predicted = energy_predict)
mae_heating
## [1] 0.002422581
mae_cooling <- mae(actual = scl_test_energy$Cooling_Load, predicted = energy_predict)
mae_cooling
## [1] 0.003542343
mae_heating %>%
  kbl(caption = "MAE for Heating Load Prediction for Model 1") %>%
  kable_classic()
MAE for Heating Load Prediction for Model 1
x
0.0024226
mae_cooling %>%
  kbl(caption = "MAE in Prediction Cooling Load for Model 1") %>%
  kable_classic()
MAE in Prediction Cooling Load for Model 1
x
0.0035423

Model Optimization

Although the R-statistics and MAE values indicated high performance with Model 1, an inspection for multicollinearity revealed use of all available variables led to high variance inflation factors (VIF) for Relative Compactness, Surface Area, and Overall Height. Wall Area was moderately high, with 7.723 being higher than the recommended VIF of 5 but lower than 10. Additionally, the residual analysis reveals a concern for homoscedasticity with a lack of normal distribution observed across all residual plots.

Model optimization was completed in a series of steps. For Model 2, variables which did not reflect a strong correlation with the correlation matrices were removed (Orientation, Glazing Area, and Glazing Area Distribution). All summary results were lower when compared to the performance of Model 1, with a R-squared and adjusted R-squared of 0.850. Model 3 was identical to Model 2 with the exception of including the Orientation variable, due to reports in the literature that building orientation has high predictive value for energy efficiency (Salleh et al., 2023). This adjustment resulted in a slightly reduced F-statistic of 685.4 (compared to 857.1 with Model 2) and identical R-squared values of 0.850 (R-squared and adjusted R-squared). Model 4 omitted all variables that had previously been identified with high VIF values, which resulted in the lowest performing model, with a R-squared of 0.0701, an adjusted R-squared of 0.0655 and a F-statistic of 15.33.

#The second model removed variables without a strong correlation. Orientation, Glazing Area and Glazing Area Distribution were removed
model2 <- lm(Heating_Load ~ Relative_Compactness + Surface_Area + Wall_Area + Roof_Area + Overall_Height,     data = scl_train_energy)
summary(model2)
## 
## Call:
## lm(formula = Heating_Load ~ Relative_Compactness + Surface_Area + 
##     Wall_Area + Roof_Area + Overall_Height, data = scl_train_energy)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0189678 -0.0027887  0.0001727  0.0029858  0.0129976 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.11506    0.03538   3.252 0.001209 ** 
## Relative_Compactness -67.28411   15.48552  -4.345 1.63e-05 ***
## Surface_Area          -0.08999    0.02567  -3.506 0.000488 ***
## Wall_Area              0.05968    0.01005   5.936 4.90e-09 ***
## Roof_Area                   NA         NA      NA       NA    
## Overall_Height         4.22503    0.51164   8.258 9.24e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004907 on 609 degrees of freedom
## Multiple R-squared:  0.8492, Adjusted R-squared:  0.8482 
## F-statistic: 857.1 on 4 and 609 DF,  p-value: < 2.2e-16
anova(model2)
## Analysis of Variance Table
## 
## Response: Heating_Load
##                       Df   Sum Sq  Mean Sq  F value    Pr(>F)    
## Relative_Compactness   1 0.037522 0.037522 1558.553 < 2.2e-16 ***
## Surface_Area           1 0.010380 0.010380  431.138 < 2.2e-16 ***
## Wall_Area              1 0.032993 0.032993 1370.429 < 2.2e-16 ***
## Overall_Height         1 0.001642 0.001642   68.193 9.245e-16 ***
## Residuals            609 0.014662 0.000024                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2_stats <- glance(model2) %>%
  transmute(
    'R-squared' = r.squared,
    'Adj. R-squared' = adj.r.squared,
    'F-statistic' = statistic,
    'F p-value' = p.value,
    'Model df' = df,
    'Residual df' = df.residual
  )

model2_stats
## # A tibble: 1 × 6
##   `R-squared` `Adj. R-squared` `F-statistic` `F p-value` `Model df`
##         <dbl>            <dbl>         <dbl>       <dbl>      <dbl>
## 1       0.849            0.848          857.   1.88e-248          4
## # ℹ 1 more variable: `Residual df` <int>
#Make predictions on the testing data
energy_predict_2 <- predict(model2, newdata = scl_test_energy)

#Evaluate performance metrics
mae_heating_2 <- mae(actual = scl_test_energy$Heating_Load, predicted = energy_predict_2)
mae_heating_2
## [1] 0.003907062
mae_cooling_2 <- mae(actual = scl_test_energy$Cooling_Load, predicted = energy_predict_2)
mae_cooling_2
## [1] 0.004347915
#The Orientation variable was added due to the reduced performance when removing variables without an apparent correlation. Orientation has been discussed in the literature as an important variable for predicting energy efficiency. Roof area was removed
model3 <- lm(Heating_Load ~ Relative_Compactness + Surface_Area + Wall_Area + Overall_Height     + Orientation, data = scl_train_energy)
summary(model3)
## 
## Call:
## lm(formula = Heating_Load ~ Relative_Compactness + Surface_Area + 
##     Wall_Area + Overall_Height + Orientation, data = scl_train_energy)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0187505 -0.0026544  0.0001093  0.0029550  0.0129299 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.11536    0.03539   3.259  0.00118 ** 
## Relative_Compactness -67.16632   15.49063  -4.336 1.70e-05 ***
## Surface_Area          -0.08985    0.02567  -3.500  0.00050 ***
## Wall_Area              0.05961    0.01006   5.927 5.16e-09 ***
## Overall_Height         4.22596    0.51178   8.257 9.31e-16 ***
## Orientation           -0.11619    0.14387  -0.808  0.41963    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004908 on 608 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8481 
## F-statistic: 685.4 on 5 and 608 DF,  p-value: < 2.2e-16
anova(model3)
## Analysis of Variance Table
## 
## Response: Heating_Load
##                       Df   Sum Sq  Mean Sq   F value    Pr(>F)    
## Relative_Compactness   1 0.037522 0.037522 1557.6625 < 2.2e-16 ***
## Surface_Area           1 0.010380 0.010380  430.8921 < 2.2e-16 ***
## Wall_Area              1 0.032993 0.032993 1369.6461 < 2.2e-16 ***
## Overall_Height         1 0.001642 0.001642   68.1537 9.437e-16 ***
## Orientation            1 0.000016 0.000016    0.6522    0.4196    
## Residuals            608 0.014646 0.000024                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3_stats <- glance(model3) %>%
  transmute(
    'R-squared' = r.squared,
    'Adj. R-squared' = adj.r.squared,
    'F-statistic' = statistic,
    'F p-value' = p.value,
    'Model df' = df,
    'Residual df' = df.residual
  )

model3_stats
## # A tibble: 1 × 6
##   `R-squared` `Adj. R-squared` `F-statistic` `F p-value` `Model df`
##         <dbl>            <dbl>         <dbl>       <dbl>      <dbl>
## 1       0.849            0.848          685.   4.24e-247          5
## # ℹ 1 more variable: `Residual df` <int>
#Make predictions on the testing data
energy_predict_3 <- predict(model3, newdata = scl_test_energy)

#Evaluate performance metrics
mae_heating_3 <- mae(actual = scl_test_energy$Heating_Load, predicted = energy_predict_3)
mae_heating_3
## [1] 0.00391559
mae_cooling_3 <- mae(actual = scl_test_energy$Cooling_Load, predicted = energy_predict_3)
mae_cooling_3
## [1] 0.004362644
#Test residuals
par(mfrow = c(2,2))
plot(model1)

#Test residuals model 2
#Test residuals
par(mfrow = c(2,2))
plot(model2)

#Test residuals of model 3
#Test residuals
par(mfrow = c(2,2))
plot(model3)

alias(model1)
## Model :
## Heating_Load ~ Relative_Compactness + Surface_Area + Wall_Area + 
##     Roof_Area + Overall_Height + Orientation + Glazing_Area + 
##     Glazing_Area_Distribution
## 
## Complete :
##           (Intercept) Relative_Compactness Surface_Area Wall_Area
## Roof_Area    0           0                  1/2         -1/2     
##           Overall_Height Orientation Glazing_Area Glazing_Area_Distribution
## Roof_Area    0              0           0            0
#Fourth model created with variables with high VIF removed to reduce multicollinearity but now is not statistically significant and therefore not considered as a reliable model
model4 <- lm(Heating_Load ~ Orientation + Glazing_Area + Glazing_Area_Distribution, data = scl_train_energy)
summary(model4)
## 
## Call:
## lm(formula = Heating_Load ~ Orientation + Glazing_Area + Glazing_Area_Distribution, 
##     data = scl_train_energy)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.016483 -0.011358 -0.005197  0.009050  0.021860 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                0.022158   0.001991  11.128  < 2e-16 ***
## Orientation               -0.087430   0.356715  -0.245    0.806    
## Glazing_Area              20.294085   3.049271   6.655  6.3e-11 ***
## Glazing_Area_Distribution -0.020309   0.262903  -0.077    0.938    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01217 on 610 degrees of freedom
## Multiple R-squared:  0.07011,    Adjusted R-squared:  0.06554 
## F-statistic: 15.33 on 3 and 610 DF,  p-value: 1.256e-09
anova(model4)
## Analysis of Variance Table
## 
## Response: Heating_Load
##                            Df   Sum Sq   Mean Sq F value    Pr(>F)    
## Orientation                 1 0.000019 0.0000185  0.1249    0.7239    
## Glazing_Area                1 0.006795 0.0067951 45.8598 2.993e-11 ***
## Glazing_Area_Distribution   1 0.000001 0.0000009  0.0060    0.9384    
## Residuals                 610 0.090384 0.0001482                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#VIF for all variables except Roof Area to inspect for multicollinearity
model1_vif <- vif(lm(Heating_Load ~ Relative_Compactness + Surface_Area + Wall_Area + Overall_Height + Orientation + Glazing_Area + Glazing_Area_Distribution, data = scl_train_energy))
model1_vif
##      Relative_Compactness              Surface_Area                 Wall_Area 
##                102.307465                195.849849                  7.722856 
##            Overall_Height               Orientation              Glazing_Area 
##                 31.287014                  1.001083                  1.040980 
## Glazing_Area_Distribution 
##                  1.042483
model1_vif %>%
  kbl(caption = "VIF for Model 1") %>%
  kable_classic()
VIF for Model 1
x
Relative_Compactness 102.307465
Surface_Area 195.849849
Wall_Area 7.722856
Overall_Height 31.287014
Orientation 1.001083
Glazing_Area 1.040980
Glazing_Area_Distribution 1.042483
model4_vif <- vif(lm(Heating_Load ~ Surface_Area + Orientation + Glazing_Area + Glazing_Area_Distribution, data = scl_train_energy))
model4_vif
##              Surface_Area               Orientation              Glazing_Area 
##                  1.001694                  1.000606                  1.040917 
## Glazing_Area_Distribution 
##                  1.041211

Model 5 included the combination of features that resulted in VIF scores for all variables of less than 2 and included the predictive variable strongly correlated with the response variable (Overall Height).

model5_vif <- vif(lm(Heating_Load ~ Overall_Height + Wall_Area + Orientation + Glazing_Area + Glazing_Area_Distribution, data = scl_train_energy))
model5_vif
##            Overall_Height                 Wall_Area               Orientation 
##                  1.098881                  1.097451                  1.000927 
##              Glazing_Area Glazing_Area_Distribution 
##                  1.040881                  1.041922
model5_vif %>%
  kbl(caption = "VIF for Model 5") %>%
  kable_classic()
VIF for Model 5
x
Overall_Height 1.098881
Wall_Area 1.097452
Orientation 1.000927
Glazing_Area 1.040881
Glazing_Area_Distribution 1.041922
model5 <- lm(Heating_Load ~ Overall_Height + Wall_Area + Orientation + Glazing_Area + Glazing_Area_Distribution, data = scl_train_energy)
summary(model5)
## 
## Call:
## lm(formula = Heating_Load ~ Overall_Height + Wall_Area + Orientation + 
##     Glazing_Area + Glazing_Area_Distribution, data = scl_train_energy)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0118520 -0.0019560 -0.0000219  0.0017496  0.0099520 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -0.029608   0.001285 -23.048  < 2e-16 ***
## Overall_Height             4.802678   0.074084  64.828  < 2e-16 ***
## Wall_Area                  0.051208   0.002928  17.490  < 2e-16 ***
## Orientation               -0.085678   0.111106  -0.771  0.44093    
## Glazing_Area              19.098421   0.949685  20.110  < 2e-16 ***
## Glazing_Area_Distribution  0.220975   0.081930   2.697  0.00719 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00379 on 608 degrees of freedom
## Multiple R-squared:  0.9101, Adjusted R-squared:  0.9094 
## F-statistic:  1232 on 5 and 608 DF,  p-value: < 2.2e-16
anova(model5)
## Analysis of Variance Table
## 
## Response: Heating_Load
##                            Df   Sum Sq  Mean Sq   F value    Pr(>F)    
## Overall_Height              1 0.077663 0.077663 5406.4055 < 2.2e-16 ***
## Wall_Area                   1 0.004311 0.004311  300.1218 < 2.2e-16 ***
## Orientation                 1 0.000018 0.000018    1.2564  0.262771    
## Glazing_Area                1 0.006368 0.006368  443.2663 < 2.2e-16 ***
## Glazing_Area_Distribution   1 0.000104 0.000104    7.2744  0.007188 ** 
## Residuals                 608 0.008734 0.000014                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 5 demonstrated optimal performance, with low multicollinearity, R-squared and adjusted R-squared above 0.9 and the highest F-statistic of 1232. Residual degrees of freedom are high (608), indicating a low risk for over-fitting. Model 5 also demonstrated an improvement in the distribution of residuals. Although residual analysis reveals a concern for homoscedasticity, an improvement is detectable with the Residuals versus Leverage plot.

Model 5.2 was also developed to review the performance of the second response variable, Cooling Load. Results were comparable with a slightly lower R-squared value of 0.878 and an adjusted R-squared value of 0.877. Further information of model 5.2 is available in the R Markdown document that accompanies this report.

model5_stats <- glance(model5) %>%
  transmute(
    'R-squared' = r.squared,
    'Adj. R-squared' = adj.r.squared,
    'F-statistic' = statistic,
    'F p-value' = p.value,
    'Model df' = df,
    'Residual df' = df.residual
  )

model5_stats
## # A tibble: 1 × 6
##   `R-squared` `Adj. R-squared` `F-statistic` `F p-value` `Model df`
##         <dbl>            <dbl>         <dbl>       <dbl>      <dbl>
## 1       0.910            0.909         1232.   2.64e-315          5
## # ℹ 1 more variable: `Residual df` <int>
#Make predictions on the testing data
energy_predict_5 <- predict(model5, newdata = scl_test_energy)

#Evaluate performance metrics
mae_heating_5 <- mae(actual = scl_test_energy$Heating_Load, predicted = energy_predict_3)
mae_heating_5
## [1] 0.00391559
mae_cooling_5 <- mae(actual = scl_test_energy$Cooling_Load, predicted = energy_predict_3)
mae_cooling_5
## [1] 0.004362644
#Test residuals
par(mfrow = c(2,2))
plot(model5)

model5_stats %>%
  kbl(caption = "Results from Model 5") %>%
  kable_classic()
Results from Model 5
R-squared Adj. R-squared F-statistic F p-value Model df Residual df
0.9101432 0.9094043 1231.665 0 5 608
mae_heating_5 %>%
  kbl(caption = "MAE for Prediction of Heating Load for Model 5") %>%
  kable_classic()
MAE for Prediction of Heating Load for Model 5
x
0.0039156
mae_cooling_5 %>%
  kbl(caption = "MAE for Prediction of Cooling Load for Model 5") %>%
  kable_classic()
MAE for Prediction of Cooling Load for Model 5
x
0.0043626
# Model 5.2 was created to predict second response variable, Cooling Load
model5.2 <- lm(Cooling_Load ~ Overall_Height + Wall_Area + Orientation + Glazing_Area + Glazing_Area_Distribution, data = scl_train_energy)
summary(model5.2)
## 
## Call:
## lm(formula = Cooling_Load ~ Overall_Height + Wall_Area + Orientation + 
##     Glazing_Area + Glazing_Area_Distribution, data = scl_train_energy)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.010002 -0.002635 -0.000186  0.001988  0.014699 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -0.020385   0.001419 -14.363   <2e-16 ***
## Overall_Height             4.633530   0.081851  56.609   <2e-16 ***
## Wall_Area                  0.040850   0.003235  12.628   <2e-16 ***
## Orientation                0.070579   0.122755   0.575    0.566    
## Glazing_Area              14.051090   1.049252  13.392   <2e-16 ***
## Glazing_Area_Distribution  0.103826   0.090520   1.147    0.252    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004187 on 608 degrees of freedom
## Multiple R-squared:  0.8778, Adjusted R-squared:  0.8768 
## F-statistic: 873.7 on 5 and 608 DF,  p-value: < 2.2e-16
anova(model5.2)
## Analysis of Variance Table
## 
## Response: Cooling_Load
##                            Df   Sum Sq  Mean Sq   F value Pr(>F)    
## Overall_Height              1 0.070458 0.070458 4018.1383 <2e-16 ***
## Wall_Area                   1 0.002735 0.002735  155.9711 <2e-16 ***
## Orientation                 1 0.000002 0.000002    0.1195 0.7297    
## Glazing_Area                1 0.003383 0.003383  192.9186 <2e-16 ***
## Glazing_Area_Distribution   1 0.000023 0.000023    1.3156 0.2518    
## Residuals                 608 0.010661 0.000018                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#View results
model5.2_stats <- glance(model5.2) %>%
  transmute(
    'R-squared' = r.squared,
    'Adj. R-squared' = adj.r.squared,
    'F-statistic' = statistic,
    'F p-value' = p.value,
    'Model df' = df,
    'Residual df' = df.residual
  )

model5.2_stats
## # A tibble: 1 × 6
##   `R-squared` `Adj. R-squared` `F-statistic` `F p-value` `Model df`
##         <dbl>            <dbl>         <dbl>       <dbl>      <dbl>
## 1       0.878            0.877          874.   9.15e-275          5
## # ℹ 1 more variable: `Residual df` <int>
#Make predictions on the testing data
energy_predict_5.2 <- predict(model5.2, newdata = scl_test_energy)

#Evaluate performance metrics
mae_heating_5.2 <- mae(actual = scl_test_energy$Heating_Load, predicted = energy_predict_3)
mae_heating_5.2
## [1] 0.00391559
mae_cooling_5.2 <- mae(actual = scl_test_energy$Cooling_Load, predicted = energy_predict_3)
mae_cooling_5.2
## [1] 0.004362644
#Test residuals
par(mfrow = c(2,2))
plot(model5.2)

model5.2_stats %>%
  kbl(caption = "Results for Cooling Load for Model 5") %>%
  kable_classic()
Results for Cooling Load for Model 5
R-squared Adj. R-squared F-statistic F p-value Model df Residual df
0.8778249 0.8768201 873.6926 0 5 608
mae_cooling_5.2 %>%
  kbl(caption = "MAE for Cooling Load Prediction for Model 5") %>%
  kable_classic()
MAE for Cooling Load Prediction for Model 5
x
0.0043626

Discussion

The Energy Efficiency dataset used in this project developed in 2012 (Tsanas and Xifara, 2012) has been referenced as a benchmark dataset for exploration of analytic methods (Žegklitz, J., & Pošík, 2017). Both multiple regression and classification models have been used to predict energy efficiency (Tsanas and Xifara, 2012, Yazici et al., 2018).

Results from this project indicate multiple regression can be considered as a possible machine learning model for prediction of energy efficiency with feature selection. The appropriate balance of prediction variables improves model performance. In this study, steps were taken to reduce model complexity and multicollinearity by ensuring low VIF scores for variables used. Also, by including the prediction variable with a strong relationship to the response variable, Model 5 demonstrated high model performance.

Limitations of this study include the absence of a classification model for comparison with the multiple regression model. Literature results indicate use of classification models, specifically Random Forest for classification of energy efficiency outperform linear regression (Tsanas and Xifara, 2012, Yazici et al., 2018). Future analysis with the Energy Efficiency dataset will include cross-validation measures to reduce the risk of over-fitting with model development. Additionally, results from this study will be compared to Random Forest classification models for analysis.

Overall, the multiple regression model developed for this study balanced model complexity with predictive power by using VIF calculations with the Spearman correlation matrix to identify essential features for inclusion. This study demonstrated multiple regression models as valuable machine learning models for use in prediction of energy efficiency in building design systems.

citation("tidyverse")
## To cite package 'tidyverse' in publications use:
## 
##   Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
##   Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
##   E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
##   Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
##   the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
##   doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Welcome to the {tidyverse}},
##     author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
##     year = {2019},
##     journal = {Journal of Open Source Software},
##     volume = {4},
##     number = {43},
##     pages = {1686},
##     doi = {10.21105/joss.01686},
##   }
citation("dplyr")
## To cite package 'dplyr' in publications use:
## 
##   Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
##   Grammar of Data Manipulation_. doi:10.32614/CRAN.package.dplyr
##   <https://doi.org/10.32614/CRAN.package.dplyr>, R package version
##   1.1.4, <https://CRAN.R-project.org/package=dplyr>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {dplyr: A Grammar of Data Manipulation},
##     author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller and Davis Vaughan},
##     year = {2023},
##     note = {R package version 1.1.4},
##     url = {https://CRAN.R-project.org/package=dplyr},
##     doi = {10.32614/CRAN.package.dplyr},
##   }
citation("readr")
## To cite package 'readr' in publications use:
## 
##   Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text
##   Data_. doi:10.32614/CRAN.package.readr
##   <https://doi.org/10.32614/CRAN.package.readr>, R package version
##   2.1.5, <https://CRAN.R-project.org/package=readr>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {readr: Read Rectangular Text Data},
##     author = {Hadley Wickham and Jim Hester and Jennifer Bryan},
##     year = {2024},
##     note = {R package version 2.1.5},
##     url = {https://CRAN.R-project.org/package=readr},
##     doi = {10.32614/CRAN.package.readr},
##   }
citation("boot")
## To cite the 'boot' package in publications use:
## 
##   Angelo Canty and Brian Ripley (2024). boot: Bootstrap R (S-Plus)
##   Functions. R package version 1.3-31.
## 
##   Davison, A. C. & Hinkley, D. V. (1997) Bootstrap Methods and Their
##   Applications. Cambridge University Press, Cambridge. ISBN
##   0-521-57391-2
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
citation("ggplot2")
## To cite ggplot2 in publications, please use
## 
##   H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
##   Springer-Verlag New York, 2016.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Book{,
##     author = {Hadley Wickham},
##     title = {ggplot2: Elegant Graphics for Data Analysis},
##     publisher = {Springer-Verlag New York},
##     year = {2016},
##     isbn = {978-3-319-24277-4},
##     url = {https://ggplot2.tidyverse.org},
##   }
citation("moments")
## To cite package 'moments' in publications use:
## 
##   Komsta L, Novomestky F (2022). _moments: Moments, Cumulants,
##   Skewness, Kurtosis and Related Tests_.
##   doi:10.32614/CRAN.package.moments
##   <https://doi.org/10.32614/CRAN.package.moments>, R package version
##   0.14.1, <https://CRAN.R-project.org/package=moments>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests},
##     author = {Lukasz Komsta and Frederick Novomestky},
##     year = {2022},
##     note = {R package version 0.14.1},
##     url = {https://CRAN.R-project.org/package=moments},
##     doi = {10.32614/CRAN.package.moments},
##   }
## 
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
citation("scales")
## To cite package 'scales' in publications use:
## 
##   Wickham H, Pedersen T, Seidel D (2025). _scales: Scale Functions for
##   Visualization_. doi:10.32614/CRAN.package.scales
##   <https://doi.org/10.32614/CRAN.package.scales>, R package version
##   1.4.0, <https://CRAN.R-project.org/package=scales>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {scales: Scale Functions for Visualization},
##     author = {Hadley Wickham and Thomas Lin Pedersen and Dana Seidel},
##     year = {2025},
##     note = {R package version 1.4.0},
##     url = {https://CRAN.R-project.org/package=scales},
##     doi = {10.32614/CRAN.package.scales},
##   }
citation("kableExtra")
## To cite package 'kableExtra' in publications use:
## 
##   Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
##   Pipe Syntax_. doi:10.32614/CRAN.package.kableExtra
##   <https://doi.org/10.32614/CRAN.package.kableExtra>, R package version
##   1.4.0, <https://CRAN.R-project.org/package=kableExtra>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
##     author = {Hao Zhu},
##     year = {2024},
##     note = {R package version 1.4.0},
##     url = {https://CRAN.R-project.org/package=kableExtra},
##     doi = {10.32614/CRAN.package.kableExtra},
##   }
citation("rsample")
## To cite package 'rsample' in publications use:
## 
##   Frick H, Chow F, Kuhn M, Mahoney M, Silge J, Wickham H (2025).
##   _rsample: General Resampling Infrastructure_.
##   doi:10.32614/CRAN.package.rsample
##   <https://doi.org/10.32614/CRAN.package.rsample>, R package version
##   1.3.1, <https://CRAN.R-project.org/package=rsample>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {rsample: General Resampling Infrastructure},
##     author = {Hannah Frick and Fanny Chow and Max Kuhn and Michael Mahoney and Julia Silge and Hadley Wickham},
##     year = {2025},
##     note = {R package version 1.3.1},
##     url = {https://CRAN.R-project.org/package=rsample},
##     doi = {10.32614/CRAN.package.rsample},
##   }
citation("tidymodels")
## To cite package 'tidymodels' in publications use:
## 
##   Kuhn et al., (2020). Tidymodels: a collection of packages for
##   modeling and machine learning using tidyverse principles.
##   https://www.tidymodels.org
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles.},
##     author = {Max Kuhn and Hadley Wickham},
##     url = {https://www.tidymodels.org},
##     year = {2020},
##   }
citation("parsnip")
## To cite package 'parsnip' in publications use:
## 
##   Kuhn M, Vaughan D (2025). _parsnip: A Common API to Modeling and
##   Analysis Functions_. doi:10.32614/CRAN.package.parsnip
##   <https://doi.org/10.32614/CRAN.package.parsnip>, R package version
##   1.4.0, <https://CRAN.R-project.org/package=parsnip>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {parsnip: A Common API to Modeling and Analysis Functions},
##     author = {Max Kuhn and Davis Vaughan},
##     year = {2025},
##     note = {R package version 1.4.0},
##     url = {https://CRAN.R-project.org/package=parsnip},
##     doi = {10.32614/CRAN.package.parsnip},
##   }
citation("caret")
## To cite caret in publications use:
## 
##   Kuhn, M. (2008). Building Predictive Models in R Using the caret
##   Package. Journal of Statistical Software, 28(5), 1–26.
##   https://doi.org/10.18637/jss.v028.i05
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Building Predictive Models in R Using the caret Package},
##     volume = {28},
##     url = {https://www.jstatsoft.org/index.php/jss/article/view/v028i05},
##     doi = {10.18637/jss.v028.i05},
##     number = {5},
##     journal = {Journal of Statistical Software},
##     author = {{Kuhn} and {Max}},
##     year = {2008},
##     pages = {1–26},
##   }
citation("yardstick")
## To cite package 'yardstick' in publications use:
## 
##   Kuhn M, Vaughan D, Hvitfeldt E (2025). _yardstick: Tidy
##   Characterizations of Model Performance_.
##   doi:10.32614/CRAN.package.yardstick
##   <https://doi.org/10.32614/CRAN.package.yardstick>, R package version
##   1.3.2, <https://CRAN.R-project.org/package=yardstick>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {yardstick: Tidy Characterizations of Model Performance},
##     author = {Max Kuhn and Davis Vaughan and Emil Hvitfeldt},
##     year = {2025},
##     note = {R package version 1.3.2},
##     url = {https://CRAN.R-project.org/package=yardstick},
##     doi = {10.32614/CRAN.package.yardstick},
##   }
citation("readxl")
## To cite package 'readxl' in publications use:
## 
##   Wickham H, Bryan J (2025). _readxl: Read Excel Files_.
##   doi:10.32614/CRAN.package.readxl
##   <https://doi.org/10.32614/CRAN.package.readxl>, R package version
##   1.4.5, <https://CRAN.R-project.org/package=readxl>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {readxl: Read Excel Files},
##     author = {Hadley Wickham and Jennifer Bryan},
##     year = {2025},
##     note = {R package version 1.4.5},
##     url = {https://CRAN.R-project.org/package=readxl},
##     doi = {10.32614/CRAN.package.readxl},
##   }
citation("gtsummary")
## To cite gtsummary in publications use:
## 
##   Sjoberg DD, Whiting K, Curry M, Lavery JA, Larmarange J. Reproducible
##   summary tables with the gtsummary package. The R Journal
##   2021;13:570–80. https://doi.org/10.32614/RJ-2021-053.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{gtsummary,
##     author = {Daniel D. Sjoberg and Karissa Whiting and Michael Curry and Jessica A. Lavery and Joseph Larmarange},
##     title = {Reproducible Summary Tables with the gtsummary Package},
##     journal = {{The R Journal}},
##     year = {2021},
##     url = {https://doi.org/10.32614/RJ-2021-053},
##     doi = {10.32614/RJ-2021-053},
##     volume = {13},
##     issue = {1},
##     pages = {570-580},
##   }
citation("broom")
## To cite package 'broom' in publications use:
## 
##   Robinson D, Hayes A, Couch S, Hvitfeldt E (2025). _broom: Convert
##   Statistical Objects into Tidy Tibbles_.
##   doi:10.32614/CRAN.package.broom
##   <https://doi.org/10.32614/CRAN.package.broom>, R package version
##   1.0.11, <https://CRAN.R-project.org/package=broom>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {broom: Convert Statistical Objects into Tidy Tibbles},
##     author = {David Robinson and Alex Hayes and Simon Couch and Emil Hvitfeldt},
##     year = {2025},
##     note = {R package version 1.0.11},
##     url = {https://CRAN.R-project.org/package=broom},
##     doi = {10.32614/CRAN.package.broom},
##   }
citation("pROC")
## If you use pROC in published research, please cite the following paper:
## 
##   Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti,
##   Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011).
##   pROC: an open-source package for R and S+ to analyze and compare ROC
##   curves. BMC Bioinformatics, 12, p. 77.  DOI: 10.1186/1471-2105-12-77
##   <http://www.biomedcentral.com/1471-2105/12/77/>
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {pROC: an open-source package for R and S+ to analyze and compare ROC curves},
##     author = {Xavier Robin and Natacha Turck and Alexandre Hainard and Natalia Tiberti and Frédérique Lisacek and Jean-Charles Sanchez and Markus Müller},
##     year = {2011},
##     journal = {BMC Bioinformatics},
##     volume = {12},
##     pages = {77},
##   }

Gupta, G., Mathur, S., Mathur, J., & Nayak, B. K. (2023). Comparison of energy-efficiency benchmarking methodologies for residential buildings. Energy and Buildings, 285. https://doi.org/10.1016/j.enbuild.2023.112920

Salleh, S. F., Suleiman, A. A., Daud, H., Othman, M., Sokkalingam, R., & Wagner, K. (2023). Tropically Adapted Passive Building: A Descriptive-Analytical Approach Using Multiple Linear Regression and Probability Models to Predict Indoor Temperature. Sustainability (Switzerland), 15(18). https://doi.org/10.3390/su151813647

Singh, S. (2023, June 1). A Comprehensive Guide to Model Evaluation in R: Step-by-Step with Examples. Medium. https://medium.com/@jscvcds/a-comprehensive-guide-to-model-evaluation-in-r-step-by-step-with-examples-4652a2b4916f

Tsanas, A., & Xifara, A. (2012). Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings, 49, 560–567. https://doi.org/10.1016/j.enbuild.2012.03.003

Tsanas, A. & Xifara, A. (2012). Energy Efficiency [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C51307.

Yazici, M. T., Basurra, S., & Gaber, M. M. (2018). Edge machine learning: Enabling smart internet of things applications. Big Data and Cognitive Computing, 2(3), 1–17. https://doi.org/10.3390/bdcc2030026

Žegklitz, J., & Pošík, P. (2017). Symbolic Regression Algorithms with Built-in Linear Regression. http://arxiv.org/abs/1701.03641