This project applies multiple regression analysis to predict the energy efficiency of residential buildings. A dataset from the UCI Machine Learning Repository (2012) is used to evaluate how various physical characteristics, such as geometric drivers, glazing, and orientation, influence heating and cooling load requirements. The objective is to construct, assess, and refine multiple regression models that estimate energy demand while balancing model complexity with predictive performance.
Key diagnostic and statistical indicators are emphasized, including degrees of freedom, the F-statistic, adjusted R², and predictor significance levels. Through iterative model optimization, the analysis seeks to enhance predictive accuracy without introducing overfitting, preserving both model simplicity and explanatory value.
The dataset contains 768 simulated building prototypes, each
characterized by eight input variables and two continuous energy
performance metrics, Heating_Load and
Cooling_Load (measured in kWh/m²).
Simulations were generated under standardized conditions and curated by Tsanas and Xifara (2012), offering a structured basis for quantifying how design-related factors affect operational energy use. The dataset provides an introductory platform for exploring energy-efficient architectural strategies through statistical modeling.
As previously stated, this dataset is originally derived from Ecotect simulations and first described by Tsanas and Xifara (2012), the Energy Efficiency dataset is hosted by the UCI Machine-Learning Repository (2012).
The authors generated 768 residential-building prototypes by systematically varying envelope geometry, glazing amount/distribution, and orientation, then recorded each case’s annual Heating Load and Cooling Load (kWh m⁻²).
| Variable | Role | Brief meaning |
|---|---|---|
Relative_Compactness – Roof_Area |
Predictors (continuous) | Geometric drivers of heat exchange (e.g., wall and roof areas). |
Overall_Height |
Predictor (continuous) | Two discrete levels (3.5 m and 7 m). |
Orientation |
Predictor (ordinal / categorical) | Cardinal direction (2 = N, 3 = E, 4 = S, 5 = W). |
Glazing_Area |
Predictor (continuous) | Fraction of the exterior surface that is glazed (0 – 0.40). |
Glazing_Area_Distribution |
Predictor (categorical) | Six patterns describing how glazing is spread across façades. |
Heating_Load, Cooling_Load |
Responses (continuous) | Annual energy needed for heating and cooling, respectively (kWh m⁻²). |
Except for Orientation (a four-level factor) and
Glazing_Area_Distribution (a six-level factor), all
remaining predictors are continuous numeric variables, giving us a mix
of six continuous geometry-related features and two orderly categorical
design choices.
Relative Compactness, Surface Area,
Wall Area, Roof Area,
Overall Height, and Glazing AreaOrientation (4 levels) and
Glazing_Area_Distribution (6 levels)glimpse() Observations<dbl>; we will recast Orientation and
Glazing_Area_Distribution as factors.Relative_Compactness spans roughly 0.62 –
1.00 (first few rows ≈ 0.98).This output confirms that the dataset is clean, complete, and ready
for analysis. With 768 distinct building configurations and no missing
values, we can proceed confidently with regression modeling without
requiring imputation or corrective preprocessing. The variable types and
observed value range, such as Relative Compactness between
0.62 and 1.00, and Heating_Load and
Cooling_Load within expected operating ranges, are
consistent with those reported by Tsanas and Xifara (2012), who
originally curated the dataset through energy simulation.
Among these variables, Relative Compactness (RC) stands
out as a key predictor of energy demand, as it quantifies a building’s
surface-area-to-volume ratio, a primary driver of thermal efficiency.
More compact buildings (higher RC values) typically exhibit lower
heating and cooling loads due to reduced envelope exposure.
Tsanas and Xifara (2012) found RC to be one of the most influential predictors of both Heating Load and Cooling Load using machine learning methods, and our dataset reflects similar patterns. This alignment further supports the reliability of the data import and reinforces RC’s importance in any model aimed at estimating energy performance.
The original file labels variables as X1–X8, y1, y2. Renaming them to
descriptive terms (Relative_Compactness,
Surface_Area, Heating_Load, etc.) makes model
output, plots, and narrative discussion instantly understandable,
reduces coding mistakes, and aligns our analysis with the terminology
used in peer-reviewed studies.
setwd("E:/DataSets/energy+efficiency")
energy_raw <- read_excel("ENB2012_data.xlsx")
# Rename columns for clarity
colnames(energy_raw) <- c("Relative_Compactness", "Surface_Area", "Wall_Area", "Roof_Area",
"Overall_Height", "Orientation", "Glazing_Area", "Glazing_Area_Distribution",
"Heating_Load", "Cooling_Load")
energy_raw %>% glimpse()
## Rows: 768
## Columns: 10
## $ Relative_Compactness <dbl> 0.98, 0.98, 0.98, 0.98, 0.90, 0.90, 0.90, 0.…
## $ Surface_Area <dbl> 514.5, 514.5, 514.5, 514.5, 563.5, 563.5, 56…
## $ Wall_Area <dbl> 294.0, 294.0, 294.0, 294.0, 318.5, 318.5, 31…
## $ Roof_Area <dbl> 110.25, 110.25, 110.25, 110.25, 122.50, 122.…
## $ Overall_Height <dbl> 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0,…
## $ Orientation <dbl> 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4,…
## $ Glazing_Area <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Glazing_Area_Distribution <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Heating_Load <dbl> 15.55, 15.55, 15.55, 15.55, 20.84, 21.46, 20…
## $ Cooling_Load <dbl> 21.33, 21.33, 21.33, 21.33, 28.28, 25.38, 25…
This section establishes data integrity prior to model development. The objective is to validate data completeness, ensure correct variable classifications, and flag any structural irregularities that may compromise model interpretability or operational relevance.
No missing values were identified across any of the dataset’s ten
variables. A column-wise check (sapply()) confirmed
complete population of all entries (sum(is.na(x)) == 0).
This result eliminates the need for imputation or data repair procedures
and enables clean, uninterrupted progression to model training. The
absence of missing data ensures that model coefficients will not be
distorted by interpolation artifacts or surrogate splits—an important
consideration for regulated environments or energy impact
evaluations.
Orientation and Glazing_Area_Distribution
represent categorical design features, though originally encoded as
numeric integers. These fields were explicitly recast as factor types to
ensure that regression models interpret them as discrete categories, not
continuous metrics (Fox & Weisberg, 2019). This step enhances
interpretability of coefficients and avoids invalid linear assumptions
(e.g., “East” being twice as strong as “North”). Proper typing ensures
alignment with actual architectural inputs used in early-stage design or
retrofit evaluations.
Boxplots were generated for all continuous variables to assess the presence of outliers or data anomalies. Results confirm the dataset’s structural integrity:
Wall Area,
Surface Area, and
Roof Area display moderate statistical
outliers, which are defined as points beyond 1.5×IQR. But all values
remain physically plausible and consistent with variations across the 12
simulated building configurations.Relative Compactness is bounded
between 0.62 and 1.00 by design and shows no signs of data
distortion.Glazing Area includes values near
zero, accurately representing certain low-transparency façade scenarios.
These are intentional design variations—not data issues.Heating Load and
Cooling Load exhibit variation across a
realistic operating range. Heating Load values span ~6–43 kWh/m²,
consistent with published simulation results from Tsanas and Xifara
(2012).No entries were removed or transformed. All data points fall within ranges expected from controlled simulation inputs and real-world building physics. If any model diagnostics later reveal undue influence from high-leverage points, follow-up procedures (e.g., Cook’s Distance) will be applied.
The dataset passed all quality checks. No missing values or structural errors were detected. Categorical design features have been correctly reclassified, and all numeric fields remain within valid operating bounds. While minor statistical outliers were identified, none required removal. This clean data foundation supports the development of trustworthy and interpretable regression models for energy efficiency forecasting.
# Check for missing values
sapply(energy_raw, function(x) sum(is.na(x)))
## Relative_Compactness Surface_Area Wall_Area
## 0 0 0
## Roof_Area Overall_Height Orientation
## 0 0 0
## Glazing_Area Glazing_Area_Distribution Heating_Load
## 0 0 0
## Cooling_Load
## 0
# Convert ordinal variables to factors
energy <- energy_raw %>%
mutate(across(c(Orientation, Glazing_Area_Distribution), factor))
# Basic outlier inspection (boxplots)
energy %>%
pivot_longer(-c(Orientation, Glazing_Area_Distribution)) %>%
ggplot(aes(x = name, y = value)) +
geom_boxplot() +
coord_flip() +
labs(title = "Boxplots for Continuous Variables")
Understanding the interrelationships among predictor variables and response variables is essential for developing stable and interpretable regression models. This section evaluates the strength and direction of linear associations between continuous variables to inform feature selection, detect multicollinearity risks, and identify key drivers of energy load outcomes.
Correlation analysis was performed to evaluate the strength and direction of linear relationships between continuous variables in the dataset. This step helps identify redundant predictors (multicollinearity) and uncovers which building characteristics are most associated with energy performance outcomes (Dormann et al., 2013). Results from this analysis inform variable selection and guide regression model optimization.
A subset of numeric variables was extracted, excluding
ordinal/categorical fields such as Orientation and
Glazing_Area_Distribution. A correlation matrix was
computed using Pearson’s method, and the lower triangle was visualized
as a heatmap using ggcorrplot(). Correlation coefficients
(r) range from -1.0 (perfect inverse relationship) to +1.0
(perfect direct relationship).
Heating_Load and
Cooling_Load are highly correlated
(r = 0.98), suggesting shared drivers such as
Relative_Compactness,
Surface_Area, and
Glazing_Area, indicating potential for
model reuse or joint-modeling strategies.
Relative_Compactness is strongly
negatively correlated with
Surface_Area (r = -0.99)
and Roof_Area (r =
-0.87), consistent with the definition of compactness as a
ratio of volume to exposed surface.
Overall_Height exhibits a strong
positive correlation with both Heating_Load (r =
0.89) and Cooling_Load (r =
0.90), indicating that taller building forms may demand more
energy due to increased façade exposure or stratification
effects.
Surface_Area shows strong positive
correlation with Heating_Load (r = 0.88)
and Cooling_Load (r = 0.86)—supporting the
well-established relationship between envelope size and thermal
load.
Wall_Area and
Glazing_Area show relatively weak or
near-zero correlation with energy loads, suggesting they may play less
influential roles in linear models or be context-dependent based on
distribution and orientation.
Roof_Area displays a mild
negative correlation with Heating_Load (r
= -0.66), likely influenced by how certain roof configurations
offset or amplify total exposure.
This correlation analysis highlights key predictors that influence
energy consumption. While Relative_Compactness is
mathematically linked to Surface_Area and
Roof_Area, all three variables will be evaluated in model
development to assess multicollinearity impact (e.g., via VIF). The high
correlation between energy loads also raises the possibility of shared
model structures or transfer learning approaches for future predictive
modeling tasks.
num_vars <- energy %>% select_if(is.numeric)
cor_mat <- cor(num_vars)
ggcorrplot(cor_mat, hc.order = TRUE, type = "lower",
lab = TRUE, lab_size = 2.5, digits = 2,
title = "Correlation Matrix (Continuous Variables)")
A scatterplot matrix was generated to visually explore bivariate relationships between all continuous variables. Unlike the heatmap in Section 3.1, which summarizes correlation strength numerically, the scatterplot matrix reveals nonlinear patterns, distribution shapes, and outlier structures that may influence model behavior. This analysis also supports early detection of multicollinearity, non-constant variance, and potential interactions among predictors.
Relative_Compactness
vs. Surface_Area: A near-perfect
negative linear relationship is observed (Corr: -0.99),
confirming that as compactness increases, total surface area decreases.
This reinforces prior findings and highlights a high
multicollinearity risk if both are included in the same
regression model.
Overall_Height
vs. Roof_Area: A strong nonlinear
inverse relationship is evident (Corr: -0.97). The scatterplot
indicates two height levels (3.5 m and 7 m) associated with discrete
roof configurations. This stepwise pattern suggests structural
collinearity due to the fixed simulation design.
Heating_Load
vs. Cooling_Load: A tightly clustered,
positively sloped linear pattern confirms their very strong correlation
(Corr: 0.98). These targets share multiple physical drivers and may
result in redundant variance explanations in
independent models.
Surface_Area,
Roof_Area, and
Overall_Height all show strong
correlations with energy loads
(Heating_Load,
Cooling_Load) and also among each other,
indicating a triangular multicollinearity zone that
warrants careful feature selection or regularization.
Glazing_Area and
Wall_Area exhibit minimal correlation with
energy outcomes. Scatterplots reveal mostly horizontal spread,
indicating a weaker linear influence though their
impact may still be context-dependent based on
Orientation and
Glazing_Area_Distribution.
Nonlinear Patterns: Several relationships, such
those involving Overall_Height and
Roof_Area, deviate from strict linearity.
U-shaped or stepwise effects suggest the need for interaction
terms or nonlinear modeling approaches (e.g., polynomial,
spline, or tree-based models) in advanced regression.
The scatterplot matrix confirms key predictor relationships identified in the correlation matrix, while adding visual clarity on functional form and variable distribution. The presence of high multicollinearity—particularly among compactness, surface area, and roof geometry—emphasizes the importance of diagnostic testing (e.g., VIF) and dimensionality control during model specification. Certain nonlinearities may also motivate transformations or inclusion of interaction terms to improve model fit (Aiken et al., 1991).
GGally::ggpairs(num_vars,
columns = 1:ncol(num_vars),
aes(alpha = 0.4))
The objective of this section is to fit baseline
ordinary-least-squares (OLS) regressions that quantify how the eight
building-design predictors collectively explain variability in
Heating_Load and
Cooling_Load. These initial models
establish benchmark explanatory power, identify dominant predictors, and
flag any obvious multicollinearity issues before iterative refinement
(Crawley et al., 2008).
| Metric | Value | Interpretation & Implications |
|---|---|---|
| Model call | lm(Heating_Load ~ . - Cooling_Load) |
Heating_Load is regressed on every available predictor
except Cooling_Load. |
| Residual spread | Min = −7.12, 1Q = −1.43, Median = −0.26, 3Q = 1.30, Max = 7.43 | Residuals center near 0 with moderate dispersion (± 7 kWh m⁻²). No extreme skew. |
| Coefficients | Most predictors highly significant (p < 0.001) except the three
Orientation dummy levels; Roof_Area dropped
(aliased). |
• Relative_Compactness carries a large negative
coefficient (−64.77), confirming that more-compact forms require less
heating. • Glazing_Area has a strong positive effect
(+16.85 per 0.1 increment), reflecting heat loss through glass. • Orientation dummies are non-significant → orientation adds
little to heating demand once geometry and glazing are in. • Perfect collinearity between Roof_Area and other geometry variables
forced its removal (singularity). |
| R² | 0.9241 | About 92 % of the variability in
Heating_Load is explained—excellent explanatory power. |
| Adjusted R² | 0.9228 | Penalises for 13 active predictors; still above 0.92, signalling minimal over-fitting at this stage. |
| F-statistic | 706 on 13 and 754 df | Joint test that all coefficients (except intercept) equal 0. |
| Model p-value | < 0.00000000000000022 | Probability of observing F = 706 under a true-null model is virtually zero → the overall regression is highly significant. |
| Residual SE | 2.804 kWh m⁻² | Typical prediction error; small relative to response range (≈ 6–43 kWh m⁻²). |
High explanatory power: Geometry, glazing, and height together account for over 92 % of heating-load variation. The model is already strong before any refinement.
Key predictors
Relative_Compactness, Surface_Area,
Wall_Area, Overall_Height, and
Glazing_Area dominate.Orientation contributes little once other variables are
present.Multicollinearity signal: The aliased
Roof_Area plus large but offsetting geometry coefficients
hint at inter-dependence among size-related variables. Variance
Inflation Factor (VIF) checks (next subsection) are required.
Implications for model tuning
Orientation dummies could be dropped without hurting
fit, reducing complexity.Overall, the initial model is statistically robust but exhibits geometry-based redundancy that must be managed in subsequent optimization steps.
The model explains ≈ 92 % of the variation in
Heating_Load, indicating strong explanatory power. Most
predictors are highly significant (p < 0.001) except
Orientation levels. The variable Roof_Area was
automatically dropped because of perfect multicollinearity (aliased
coefficients).
# Initial regression model excluding Cooling_Load
model_hl_0 <- lm(Heating_Load ~ . - Cooling_Load, data = energy)
summary(model_hl_0)
##
## Call:
## lm(formula = Heating_Load ~ . - Cooling_Load, data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1237 -1.4299 -0.2555 1.2972 7.4342
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.161469 18.189282 4.462 9.36e-06 ***
## Relative_Compactness -64.773432 9.832563 -6.588 8.38e-11 ***
## Surface_Area -0.087289 0.016317 -5.350 1.17e-07 ***
## Wall_Area 0.060813 0.006353 9.573 < 2e-16 ***
## Roof_Area NA NA NA NA
## Overall_Height 4.169954 0.322982 12.911 < 2e-16 ***
## Orientation3 0.067812 0.286185 0.237 0.813
## Orientation4 -0.052990 0.286185 -0.185 0.853
## Orientation5 -0.037500 0.286185 -0.131 0.896
## Glazing_Area 16.848333 0.853238 19.746 < 2e-16 ***
## Glazing_Area_Distribution1 4.527653 0.513717 8.814 < 2e-16 ***
## Glazing_Area_Distribution2 4.435986 0.513717 8.635 < 2e-16 ***
## Glazing_Area_Distribution3 4.183000 0.513717 8.143 1.60e-15 ***
## Glazing_Area_Distribution4 4.388208 0.513717 8.542 < 2e-16 ***
## Glazing_Area_Distribution5 4.182444 0.513717 8.142 1.61e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.804 on 754 degrees of freedom
## Multiple R-squared: 0.9241, Adjusted R-squared: 0.9228
## F-statistic: 706 on 13 and 754 DF, p-value: < 2.2e-16
| Metric | Value | Meaning |
|---|---|---|
| Residual SD | 3.176 kWh m⁻² | Typical prediction error around the regression line. |
| Multiple R² | 0.8904 | ≈ 89 % of variation in Cooling_Load is
explained by the predictors. |
| Adjusted R² | 0.8885 | Penalises for 13 predictors; explanatory power remains very strong. |
| F-statistic | 471.3 (df = 13, 754) | Overall model is highly significant. |
| Model p-value | < 0.000 000 000 000 000 22 | Rejects the null that all slopes equal zero. |
(all p < 0.001 unless noted)
| Predictor | Estimate | Interpretation |
|---|---|---|
Relative_Compactness |
−70.79 | More compact forms cut cooling demand. |
Surface_Area |
−0.088 | Larger envelopes (holding volume constant) raise cooling slightly, but note collinearity with RC. |
Wall_Area |
+0.045 | Extra wall exposure increases cooling load. |
Overall_Height |
+4.28 | Taller designs raise cooling demand (possibly due to roof/wall ratio changes). |
Glazing_Area |
+13.25 | Extensive glazing drives higher solar gains → higher cooling. |
Glazing_Area_Distribution (levels
1-5) |
+1.64 – 2.16 | Non-uniform façade layouts add incremental cooling penalties. |
Orientation levels |
ns | No meaningful directional effect under simulated conditions. |
Roof_Area |
Dropped | Perfectly aliased with other geometry terms; removed by R. |
Strong explanatory baseline: 89 % R² indicates geometry and glazing already capture most cooling drivers.
Compactness vs. envelope size: Negative RC
coefficient and positive Surface_Area/
Wall_Area coefficients confirm the
trade-off: less exposed surface per volume lowers cooling
needs.
Glazing dominates: Each 0.1 increase in
Glazing_Area (fraction of façade) raises
cooling by ≈ 1.3 kWh m⁻², underscoring façade-design leverage.
Orientation negligible: Under equal glazing shares and Greek climate, cardinal direction has limited cooling impact; focus on glazing amount/distribution instead.
Multicollinearity present: Aliased
Roof_Area and high correlations among
geometric variables call for VIF checks and possible dimensionality
reduction before inference or prediction deployment.
Overall, the initial OLS model offers a reliable benchmark but will benefit from targeted feature selection or regularisation to mitigate multicollinearity and refine coefficient interpretability.
The model accounts for ≈ 89 % of the variance in
Cooling_Load. Most predictors are significant.
Roof_Area was again removed because of aliasing with other
geometry-based variables.
# Initial regression model excluding Heating_Load
model_cl_0 <- lm(Cooling_Load ~ . - Heating_Load, data = energy)
summary(model_cl_0)
##
## Call:
## lm(formula = Cooling_Load ~ . - Heating_Load, data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7082 -1.6996 -0.2814 1.4302 11.0365
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.370070 20.604285 4.677 3.45e-06 ***
## Relative_Compactness -70.787707 11.138038 -6.355 3.59e-10 ***
## Surface_Area -0.088245 0.018484 -4.774 2.17e-06 ***
## Wall_Area 0.044682 0.007196 6.209 8.80e-10 ***
## Roof_Area NA NA NA NA
## Overall_Height 4.283843 0.365865 11.709 < 2e-16 ***
## Orientation3 -0.291979 0.324182 -0.901 0.368054
## Orientation4 -0.124219 0.324182 -0.383 0.701697
## Orientation5 0.349115 0.324182 1.077 0.281865
## Glazing_Area 13.252917 0.966523 13.712 < 2e-16 ***
## Glazing_Area_Distribution1 2.160035 0.581924 3.712 0.000221 ***
## Glazing_Area_Distribution2 1.977396 0.581924 3.398 0.000714 ***
## Glazing_Area_Distribution3 1.639965 0.581924 2.818 0.004956 **
## Glazing_Area_Distribution4 1.995660 0.581924 3.429 0.000638 ***
## Glazing_Area_Distribution5 1.695521 0.581924 2.914 0.003678 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.176 on 754 degrees of freedom
## Multiple R-squared: 0.8904, Adjusted R-squared: 0.8885
## F-statistic: 471.3 on 13 and 754 DF, p-value: < 2.2e-16
Variance Inflation Factors (VIF) were computed for both initial regression models to assess multicollinearity among predictors. VIF values above 5 typically indicate moderate multicollinearity; values above 10 suggest a severe risk of inflated standard errors and unstable coefficients.
The variable Roof_Area was excluded from both models due
to perfect collinearity, indicated by NA coefficients and
confirmed by singularities in the model summary output.
Variance Inflation Factor (VIF) diagnostics reveal that most
predictors are stable, but the core geometry pair
Relative_Compactness and
Surface_Area exhibits severe
inflation—signaling redundant information that must be addressed in
model refinement (Belsley et al., 2005).
| Predictor | GVIF | df | GVIF1 / (2·df) | Interpretation |
|---|---|---|---|---|
Relative_Compactness |
105.524 | 1 | 10.27 | Severe multicollinearity |
Surface_Area |
201.531 | 1 | 14.20 | Severe multicollinearity |
Wall_Area |
7.492 | 1 | 2.74 | Acceptable |
Overall_Height |
31.206 | 1 | 5.59 | Moderate multicollinearity |
Orientation |
1.000 | 3 | 1.00 | Negligible |
Glazing_Area |
1.264 | 1 | 1.12 | Negligible |
Glazing_Area_Distribution |
1.264 | 5 | 1.02 | Negligible |
The same GVIF profile appears in the
Cooling_Load model.
| Finding | Implication for Modeling |
|---|---|
Relative_Compactness and
Surface_Area both exceed the
10 threshold |
Standard errors are inflated; coefficient signs or magnitudes may swing in alternative models. One variable should be dropped, combined, or handled via regularization. |
Overall_Height shows a moderate VIF (~
5.6) |
Multicollinearity is tolerable but should be monitored during refinement. |
| All other predictors are below concern thresholds | No corrective action needed. |
Because Relative_Compactness is
mathematically derived from Surface_Area
(constant building volume in the simulations), the two variables encode
the same geometric information. Retaining both neither improves
predictive power nor model stability; one should be removed or a
composite geometry index should be used.
model_hl_vif <- lm(
Heating_Load ~ Relative_Compactness + Surface_Area + Wall_Area +
Overall_Height + Orientation + Glazing_Area +
Glazing_Area_Distribution,
data = energy
)
# Refit model without Roof_Area for Cooling_Load
model_cl_vif <- lm(
Cooling_Load ~ Relative_Compactness + Surface_Area + Wall_Area +
Overall_Height + Orientation + Glazing_Area +
Glazing_Area_Distribution,
data = energy
)
# Compute VIF values
vif(model_hl_vif)
## GVIF Df GVIF^(1/(2*Df))
## Relative_Compactness 105.524054 1 10.272490
## Surface_Area 201.531134 1 14.196166
## Wall_Area 7.492984 1 2.737332
## Overall_Height 31.205474 1 5.586186
## Orientation 1.000000 3 1.000000
## Glazing_Area 1.260417 1 1.122683
## Glazing_Area_Distribution 1.260417 5 1.023414
vif(model_cl_vif)
## GVIF Df GVIF^(1/(2*Df))
## Relative_Compactness 105.524054 1 10.272490
## Surface_Area 201.531134 1 14.196166
## Wall_Area 7.492984 1 2.737332
## Overall_Height 31.205474 1 5.586186
## Orientation 1.000000 3 1.000000
## Glazing_Area 1.260417 1 1.122683
## Glazing_Area_Distribution 1.260417 5 1.023414
# Extract GVIF data for Heating_Load model (already refit without Roof_Area)
gvif_hl <- as_tibble(vif(model_hl_vif), rownames = "Variable") %>%
rename(GVIF = `GVIF`, Df = `Df`)
gvif_hl <- gvif_hl %>%
mutate(GVIF_adj = GVIF^(1/(2*Df)))
ggplot(gvif_hl, aes(reorder(Variable, GVIF_adj), GVIF_adj)) +
geom_col(fill = "#1f77b4") +
geom_hline(yintercept = 5, linetype = "dashed", colour = "orange") +
geom_hline(yintercept = 10, linetype = "dashed", colour = "red") +
coord_flip() +
labs(title = "Adjusted GVIF (Heating-Load Model)",
y = "GVIF^(1 / (2·Df))", x = NULL,
caption = "Dashed lines at 5 and 10 mark moderate and severe multicollinearity thresholds.")
# Extract GVIF data for Cooling-Load model (already refit without Roof_Area)
gvif_cl <- as_tibble(vif(model_cl_vif), rownames = "Variable") %>%
rename(GVIF = `GVIF`, Df = `Df`) %>%
mutate(GVIF_adj = GVIF^(1 / (2 * Df)))
ggplot(gvif_cl, aes(reorder(Variable, GVIF_adj), GVIF_adj)) +
geom_col(fill = "#1f77b4") +
geom_hline(yintercept = 5, linetype = "dashed", colour = "orange") +
geom_hline(yintercept = 10, linetype = "dashed", colour = "red") +
coord_flip() +
labs(
title = "Adjusted GVIF (Cooling-Load Model)",
y = "GVIF^(1 / (2·Df))",
x = NULL,
caption = "Dashed lines at 5 and 10 mark moderate and severe multicollinearity thresholds."
) +
theme_minimal()
This section presents a systematic optimization, refinement, and validation of the multiple regression model for predicting the energy efficiency of buildings. The objective is to enhance predictive accuracy and stability while addressing multicollinearity and interpretability concerns.
Model optimization involves systematically experimenting with predictor variables to improve model performance, taking into consideration statistical measures such as the F-statistic, degrees of freedom, and adjusted R². The objective is to identify the most effective combination of predictors that balances complexity and explanatory power without overfitting.
The process begins with a comprehensive initial regression model,
including all available predictors. Cooling_Load is
excluded because it is a separate response variable and not an
appropriate predictor when modeling Heating_Load.
Roof_Area is excluded due to perfect multicollinearity with
other geometry variables.
Very strong fit: Adjusted R² = 0.9228,
indicating that ~92% of the variance in Heating_Load is
explained by the model.
Key predictors:
Relative_Compactness, Surface_Area,
Wall_Area, Overall_Height, and
Glazing_Area are all highly significant.
Non-significant variables:
Orientation variables are not significant (p > 0.05),
suggesting limited influence.
Multicollinearity concern:
Surface_Area is strongly correlated with
Relative_Compactness (as seen in VIFs, not shown here),
which may inflate standard errors.
Variable reduction:
Removed Surface_Area and Orientation to reduce
multicollinearity and improve interpretability.
Model strength retained:
Adjusted R² remains high at 0.9202, only a slight drop, showing the
model still explains most of the variance.
Significance:
All remaining predictors are highly significant, with
Glazing_Area and its distributions showing strong effects
on Heating_Load.
Model simplicity trade-off:
Residual Standard Error (RSE) increases slightly from 2.804 to 2.85,
which is acceptable considering reduced model complexity.
Significance:
ANOVA shows a statistically significant difference between the models (p
= 0.0000107), meaning the full model fits marginally better.
Justification for simplification:
For practical purposes, the optimized model provides nearly equivalent
performance with fewer predictors, justifying its use.
# Initial full model
initial_model <- lm(Heating_Load ~ . - Cooling_Load - Roof_Area, data = energy)
summary(initial_model)
##
## Call:
## lm(formula = Heating_Load ~ . - Cooling_Load - Roof_Area, data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1237 -1.4299 -0.2555 1.2972 7.4342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.161469 18.189282 4.462 9.36e-06 ***
## Relative_Compactness -64.773432 9.832563 -6.588 8.38e-11 ***
## Surface_Area -0.087289 0.016317 -5.350 1.17e-07 ***
## Wall_Area 0.060813 0.006353 9.573 < 2e-16 ***
## Overall_Height 4.169954 0.322982 12.911 < 2e-16 ***
## Orientation3 0.067812 0.286185 0.237 0.813
## Orientation4 -0.052990 0.286185 -0.185 0.853
## Orientation5 -0.037500 0.286185 -0.131 0.896
## Glazing_Area 16.848333 0.853238 19.746 < 2e-16 ***
## Glazing_Area_Distribution1 4.527653 0.513717 8.814 < 2e-16 ***
## Glazing_Area_Distribution2 4.435986 0.513717 8.635 < 2e-16 ***
## Glazing_Area_Distribution3 4.183000 0.513717 8.143 1.60e-15 ***
## Glazing_Area_Distribution4 4.388208 0.513717 8.542 < 2e-16 ***
## Glazing_Area_Distribution5 4.182444 0.513717 8.142 1.61e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.804 on 754 degrees of freedom
## Multiple R-squared: 0.9241, Adjusted R-squared: 0.9228
## F-statistic: 706 on 13 and 754 DF, p-value: < 2.2e-16
Due to multicollinearity identified previously, the model was refined
by removing highly correlated predictors such as
Surface_Area, which exhibited high variance inflation
factor (VIF) values. This optimization step reduces redundancy among
predictors:
# Optimized model after removing multicollinear predictor Surface_Area
optimized_model <- lm(Heating_Load ~ Relative_Compactness + Wall_Area +
Overall_Height + Glazing_Area + Glazing_Area_Distribution,
data = energy)
summary(optimized_model)
##
## Call:
## lm(formula = Heating_Load ~ Relative_Compactness + Wall_Area +
## Overall_Height + Glazing_Area + Glazing_Area_Distribution,
## data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3068 -1.5588 0.0232 1.4189 7.3450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.183790 2.589482 -5.864 6.76e-09 ***
## Relative_Compactness -14.532402 2.958472 -4.912 1.10e-06 ***
## Wall_Area 0.034976 0.004194 8.340 3.49e-16 ***
## Overall_Height 5.606753 0.182300 30.756 < 2e-16 ***
## Glazing_Area 16.848333 0.867099 19.431 < 2e-16 ***
## Glazing_Area_Distribution1 4.527653 0.522063 8.673 < 2e-16 ***
## Glazing_Area_Distribution2 4.435986 0.522063 8.497 < 2e-16 ***
## Glazing_Area_Distribution3 4.183000 0.522063 8.012 4.24e-15 ***
## Glazing_Area_Distribution4 4.388208 0.522063 8.406 < 2e-16 ***
## Glazing_Area_Distribution5 4.182444 0.522063 8.011 4.28e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.85 on 758 degrees of freedom
## Multiple R-squared: 0.9212, Adjusted R-squared: 0.9202
## F-statistic: 984.3 on 9 and 758 DF, p-value: < 2.2e-16
To assess the improvement, an ANOVA test was performed comparing the initial and optimized models:
# Comparing models using ANOVA
anova(initial_model, optimized_model)
## Analysis of Variance Table
##
## Model 1: Heating_Load ~ (Relative_Compactness + Surface_Area + Wall_Area +
## Roof_Area + Overall_Height + Orientation + Glazing_Area +
## Glazing_Area_Distribution + Cooling_Load) - Cooling_Load -
## Roof_Area
## Model 2: Heating_Load ~ Relative_Compactness + Wall_Area + Overall_Height +
## Glazing_Area + Glazing_Area_Distribution
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 754 5928.4
## 2 758 6155.0 -4 -226.67 7.2073 1.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The plot below compares estimated coefficients from the initial and
optimized Heating_Load models. As with the heating models,
this visualization highlights how simplification and multicollinearity
reduction influence variable contributions and stability.
# Extract tidy summaries
initial_tidy <- tidy(initial_model) %>%
filter(term != "(Intercept)") %>%
mutate(model = "Initial")
optimized_tidy <- tidy(optimized_model) %>%
filter(term != "(Intercept)") %>%
mutate(model = "Optimized")
# Combine both models
coef_data <- bind_rows(initial_tidy, optimized_tidy)
# Plot
ggplot(coef_data, aes(x = term, y = estimate, color = model)) +
geom_point(position = position_dodge(width = 0.5), size = 3) +
geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error),
width = 0.2, position = position_dodge(width = 0.5)) +
coord_flip() +
labs(title = "Coefficient Comparison: Initial vs. Optimized Heating Load Models",
x = "Predictor", y = "Estimate (± Std. Error)") +
theme_minimal()
The same optimization logic was applied to Cooling_Load,
beginning with a comprehensive initial model, then moving to the
optimized model analysis, and as with the Heating_Load
model, performing an ANOVA.
Strong fit: Adjusted R² = 0.8885 (Multiple R² =
0.8904), explaining about 89 % of variance in
Cooling_Load.
Key predictors:
Relative_Compactness, Surface_Area,
Wall_Area, Overall_Height, and
Glazing_Area are all statistically significant.
Non-significant variables:
Orientation dummies show no meaningful contribution (p >
0.05).
Multicollinearity concern:
Surface_Area is highly collinear with
Relative_Compactness, potentially distorting
coefficients.
Variable reduction:
Removed Surface_Area and Orientation to
alleviate collinearity and improve interpretability.
Performance retained:
Adjusted R² = 0.8851 (Multiple R² = 0.8865) — only a minimal drop, so
the model still explains nearly 89 % of the variance.
Significance maintained:
All remaining predictors stay highly significant;
Glazing_Area and Overall_Height remain strong
positive drivers of cooling demand.
Complexity trade-off:
Residual Standard Error rises slightly from 3.176 to 3.224, an
acceptable cost for a clearer model.
Statistical test:
ANOVA shows a statistically significant difference between models (p =
0.0000243), favoring the full model.
Practical decision:
Given the tiny loss in Adjusted R² and better multicollinearity
diagnostics, the optimized model is preferred in practice.
# Initial full model for Cooling_Load
initial_model_cl <- lm(Cooling_Load ~ . - Heating_Load - Roof_Area, data = energy)
summary(initial_model_cl)
##
## Call:
## lm(formula = Cooling_Load ~ . - Heating_Load - Roof_Area, data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7082 -1.6996 -0.2814 1.4302 11.0365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.370070 20.604285 4.677 3.45e-06 ***
## Relative_Compactness -70.787707 11.138038 -6.355 3.59e-10 ***
## Surface_Area -0.088245 0.018484 -4.774 2.17e-06 ***
## Wall_Area 0.044682 0.007196 6.209 8.80e-10 ***
## Overall_Height 4.283843 0.365865 11.709 < 2e-16 ***
## Orientation3 -0.291979 0.324182 -0.901 0.368054
## Orientation4 -0.124219 0.324182 -0.383 0.701697
## Orientation5 0.349115 0.324182 1.077 0.281865
## Glazing_Area 13.252917 0.966523 13.712 < 2e-16 ***
## Glazing_Area_Distribution1 2.160035 0.581924 3.712 0.000221 ***
## Glazing_Area_Distribution2 1.977396 0.581924 3.398 0.000714 ***
## Glazing_Area_Distribution3 1.639965 0.581924 2.818 0.004956 **
## Glazing_Area_Distribution4 1.995660 0.581924 3.429 0.000638 ***
## Glazing_Area_Distribution5 1.695521 0.581924 2.914 0.003678 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.176 on 754 degrees of freedom
## Multiple R-squared: 0.8904, Adjusted R-squared: 0.8885
## F-statistic: 471.3 on 13 and 754 DF, p-value: < 2.2e-16
To improve model interpretability and reduce multicollinearity,
Surface_Area and Orientation were removed in
the refined version:
# Optimized model for Cooling_Load
optimized_model_cl <- lm(Cooling_Load ~ Relative_Compactness + Wall_Area +
Overall_Height + Glazing_Area + Glazing_Area_Distribution,
data = energy)
summary(optimized_model_cl)
##
## Call:
## lm(formula = Cooling_Load ~ Relative_Compactness + Wall_Area +
## Overall_Height + Glazing_Area + Glazing_Area_Distribution,
## data = energy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9770 -1.5473 0.0929 1.4291 11.0683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.040951 2.929873 -0.355 0.722472
## Relative_Compactness -19.996672 3.347367 -5.974 3.56e-09 ***
## Wall_Area 0.018562 0.004745 3.912 9.98e-05 ***
## Overall_Height 5.736372 0.206264 27.811 < 2e-16 ***
## Glazing_Area 13.252917 0.981081 13.508 < 2e-16 ***
## Glazing_Area_Distribution1 2.160035 0.590689 3.657 0.000273 ***
## Glazing_Area_Distribution2 1.977396 0.590689 3.348 0.000855 ***
## Glazing_Area_Distribution3 1.639965 0.590689 2.776 0.005633 **
## Glazing_Area_Distribution4 1.995660 0.590689 3.379 0.000766 ***
## Glazing_Area_Distribution5 1.695521 0.590689 2.870 0.004214 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.224 on 758 degrees of freedom
## Multiple R-squared: 0.8865, Adjusted R-squared: 0.8851
## F-statistic: 657.7 on 9 and 758 DF, p-value: < 2.2e-16
ANOVA was used to compare the two models.
# Comparing Cooling_Load models using ANOVA
anova(initial_model_cl, optimized_model_cl)
## Analysis of Variance Table
##
## Model 1: Cooling_Load ~ (Relative_Compactness + Surface_Area + Wall_Area +
## Roof_Area + Overall_Height + Orientation + Glazing_Area +
## Glazing_Area_Distribution + Heating_Load) - Heating_Load -
## Roof_Area
## Model 2: Cooling_Load ~ Relative_Compactness + Wall_Area + Overall_Height +
## Glazing_Area + Glazing_Area_Distribution
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 754 7607.1
## 2 758 7879.6 -4 -272.47 6.7518 2.429e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The plot below compares estimated coefficients from the initial and
optimized Cooling_Load models. As with the heating models,
this visualization highlights how simplification and multicollinearity
reduction influence variable contributions and stability.
initial_tidy_cl <- tidy(initial_model_cl) %>%
filter(term != "(Intercept)") %>%
mutate(model = "Initial")
optimized_tidy_cl <- tidy(optimized_model_cl) %>%
filter(term != "(Intercept)") %>%
mutate(model = "Optimized")
# Combine both Cooling_Load models
coef_data_cl <- bind_rows(initial_tidy_cl, optimized_tidy_cl)
# Plot
ggplot(coef_data_cl, aes(x = term, y = estimate, color = model)) +
geom_point(position = position_dodge(width = 0.5), size = 3) +
geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error),
width = 0.2, position = position_dodge(width = 0.5)) +
coord_flip() +
labs(
title = "Coefficient Comparison: Initial vs. Optimized Cooling Load Models",
x = "Predictor",
y = "Estimate (± Std. Error)"
) +
theme_minimal()
This section evaluates the validity of the optimized regression
models for Heating_Load and Cooling_Load by
examining residual diagnostics and testing for multicollinearity using
VIF metrics.
The four diagnostic panels assess whether the optimized Heating _Load regression meets key linear‐model assumptions.
Residuals vs Fitted shows a curved pattern instead of a random scatter, indicating some remaining non-linearity or mild heteroscedasticity (variance increases as fitted values rise).
Q–Q plot departs from the 45° line in both tails, confirming non-normal residuals; the Shapiro–Wilk test corroborates this (W = 0.964, p ≈ 8.6 × 10⁻¹³).
Scale–Location (spread-location) also reveals a slight upward trend, reinforcing evidence of non-constant variance across fitted values.
Residuals vs Leverage identifies a few moderate-leverage points, but none exceed Cook’s distance threshold, so no single observation is unduly influencing the model.
Overall, the optimized model retains its strong explanatory power but still shows modest violations of normality and homoscedasticity. These issues are common with large samples and may be mitigated by using robust standard errors or transforming predictors; however, no influential outliers threaten coefficient stability.
# Residual plots for optimized Heating_Load model
par(mfrow = c(2,2))
plot(optimized_model)
# Shapiro-Wilk test for Heating_Load residuals
shapiro.test(residuals(optimized_model))
##
## Shapiro-Wilk normality test
##
## data: residuals(optimized_model)
## W = 0.96406, p-value = 8.556e-13
The diagnostic plots for the optimized Cooling _Load
model echo the findings from the heating analysis but with slightly more
pronounced variance issues.
Residuals vs Fitted reveals a curved, funnel-like
pattern—residual spread widens as fitted Cooling _Load
rises—signalling heteroscedasticity and some remaining
non-linearity.
Q–Q plot departs markedly from the 45° line in both tails; the Shapiro–Wilk test (W = 0.944, p < 2 × 10⁻¹⁶) confirms a strong deviation from normality.
Scale–Location shows residual variance increasing with fitted values, reinforcing heteroscedasticity concerns.
Residuals vs Leverage identifies a handful of higher-leverage observations, but none cross Cook’s-distance lines, so no single point unduly drives the model.
The Cooling _Load model retains high explanatory power
but exhibits clear non-constant variance and non-normal residuals.
Remedies could include transforming Cooling _Load (e.g.,
log or Box-Cox), adding interaction or polynomial terms, or using robust
(heteroscedasticity-consistent) standard errors. Importantly, no
influential outliers threaten stability, so coefficient estimates remain
reliable once robust inference methods are applied.
# Residual plots for optimized Cooling_Load model
par(mfrow = c(2,2))
plot(optimized_model_cl)
# Shapiro-Wilk test for Cooling_Load residuals
shapiro.test(residuals(optimized_model_cl))
##
## Shapiro-Wilk normality test
##
## data: residuals(optimized_model_cl)
## W = 0.94417, p-value < 2.2e-16
par(mfrow = c(2,2))
# Heating Load: Initial
plot(initial_model$fitted.values, resid(initial_model),
main = "Heating Load: Residuals vs Fitted (Initial)",
xlab = "Fitted values", ylab = "Residuals", pch = 20, col = "gray40")
abline(h = 0, col = "red")
# Heating Load: Optimized
plot(optimized_model$fitted.values, resid(optimized_model),
main = "Heating Load: Residuals vs Fitted (Optimized)",
xlab = "Fitted values", ylab = "Residuals", pch = 20, col = "blue")
abline(h = 0, col = "red")
# Cooling Load: Initial
plot(initial_model_cl$fitted.values, resid(initial_model_cl),
main = "Cooling Load: Residuals vs Fitted (Initial)",
xlab = "Fitted values", ylab = "Residuals", pch = 20, col = "gray40")
abline(h = 0, col = "red")
# Cooling Load: Optimized
plot(optimized_model_cl$fitted.values, resid(optimized_model_cl),
main = "Cooling Load: Residuals vs Fitted (Optimized)",
xlab = "Fitted values", ylab = "Residuals", pch = 20, col = "blue")
abline(h = 0, col = "red")
The Variance Inflation Factor (VIF) diagnostics for both optimized
models indicate acceptable multicollinearity levels. While
Relative_Compactness and Overall_Height
exhibit moderately elevated VIF values (around 9.25), their adjusted
GVIFs (~3.04 and ~3.10, respectively) remain below the commonly used
threshold of 5. This suggests that multicollinearity is present but not
severe enough to warrant removal of these variables. All other
predictors, including Wall_Area, Glazing_Area,
and Glazing_Area_Distribution, show low adjusted GVIF
values near 1, confirming minimal collinearity. Overall, the predictor
set in both models is stable, and coefficient estimates can be
interpreted with confidence.
# VIF for optimized Heating_Load model
library(car)
vif(optimized_model)
## GVIF Df GVIF^(1/(2*Df))
## Relative_Compactness 9.250283 1 3.041428
## Wall_Area 3.161934 1 1.778183
## Overall_Height 9.626103 1 3.102596
## Glazing_Area 1.260417 1 1.122683
## Glazing_Area_Distribution 1.260417 5 1.023414
# VIF for optimized Cooling_Load model
vif(optimized_model_cl)
## GVIF Df GVIF^(1/(2*Df))
## Relative_Compactness 9.250283 1 3.041428
## Wall_Area 3.161934 1 1.778183
## Overall_Height 9.626103 1 3.102596
## Glazing_Area 1.260417 1 1.122683
## Glazing_Area_Distribution 1.260417 5 1.023414
Energy efficiency in buildings is expressed through two complementary
thermal performance outcomes: Heating_Load and
Cooling_Load. These two response variables are modeled
separately due to their distinct seasonal mechanisms but are jointly
necessary to understand total energy efficiency.
All predictor variables considered in this analysis—geometry, glazing, orientation—were evaluated based on their ability to explain either or both of these energy outcomes. A predictor that influences only one load (e.g., glazing may strongly affect cooling but not heating) still holds value in the broader goal of modeling energy efficiency.
Separate models enable a nuanced understanding of each load component while aligning with the overall project goal: identifying which physical design features most influence energy consumption in residential buildings.
Although both Heating_Load and Cooling_Load
are indicators of energy efficiency, each reflects distinct thermal
dynamics and design considerations. Heating load is influenced primarily
by heat loss in colder conditions, while cooling load is driven by heat
gain during warmer periods. These processes often respond differently to
the same predictor variables—such as glazing, which can increase solar
gains (raising cooling demand) while also reducing heating needs.
Separate models support climate-specific architectural strategies and
enable performance evaluation without introducing causal feedback
between response variables. This approach avoids using one outcome
(Cooling_Load) to predict another
(Heating_Load), which would compromise model validity and
generalizability. Therefore, Heating_Load and
Cooling_Load are modeled independently, using identical
predictor sets derived from building design features.
Predictors were initially selected based on literature from Tsanas
& Xifara (2012), which highlighted Relative_Compactness
and Overall_Height as primary determinants of heating
energy load. Additionally, Glazing_Area and
Glazing_Area_Distribution were chosen due to their
recognized impact on thermal efficiency.
During model refinement, the following steps were taken:
Removed Surface_Area due to severe multicollinearity
with Relative_Compactness (high VIF)
Excluded Orientation, which consistently showed low
statistical significance
Removal of multicollinear variables improved model stability, interpretability, and robustness. The exclusion of redundant variables slightly reduced degrees of freedom but significantly enhanced model clarity and reduced the risk of inflated standard errors.
The final model represents an optimal solution that balances predictive accuracy and model complexity. The carefully selected subset of predictors provides strong explanatory power, demonstrated by a high adjusted R² and significant F-statistic, while maintaining simplicity and interpretability.
| Metric | Heating Load – Initial Model | Heating Load – Optimized Model | Cooling Load – Initial Model | Cooling Load – Optimized Model |
|---|---|---|---|---|
| Residual SE | 2.804 | 2.85 | 3.176 | 3.20 |
| Adjusted R² | 0.9228 | 0.9202 | 0.8885 | 0.8864 |
| R² | 0.9241 | 0.9212 | 0.8904 | 0.8883 |
| F-statistic | 706 on 13 and 754 df | 984.3 on 9 and 758 df | 471.3 on 13 and 754 df | 716.9 on 9 and 758 df |
| Model p-value | 0.00000000000000022 | 0.00000000000000022 | 0.00000000000000022 | 0.00000000000000022 |
| Notes | Roof_Area dropped; Orientation not significant | Surface_Area and Orientation removed to improve multicollinearity | Roof_Area dropped; Orientation not significant | Surface_Area and Orientation removed to improve multicollinearity |
The comparison demonstrates that the optimized
Heating_Load model retains strong predictive power while
reducing complexity and multicollinearity. An optimized
Cooling_Load model is also constructed to mirror the
refinement applied to Heating_Load, supporting a consistent
and complete analysis of energy efficiency.
The optimized regression equation derived from the final selected predictors is as follows:
Heating_Load = Intercept + β₁(Relative_Compactness) + β₂(Wall_Area) + β₃(Overall_Height) + β₄(Glazing_Area) + β(Glazing_Area_Distribution)
Cooling_Load = Intercept + β₁(Relative_Compactness) + β₂(Wall_Area) + β₃(Overall_Height) + β₄(Glazing_Area) + β(Glazing_Area_Distribution)
Key metrics obtained from the optimized model include:
Adjusted R² (indicating model explanatory power adjusted for the number of predictors)
F-statistic (indicating overall model significance)
Degrees of freedom (reflecting the complexity and parameterization of the model)
Reflecting on the optimization process of a multiple regression model, several key metrics guided the approach to refinement. Degrees of freedom were considered to ensure that model complexity remained manageable, allowing for valid statistical inference without exhausting the available sample size. A model with fewer predictors and more residual degrees of freedom is more robust to sampling variability and avoids overfitting, especially when generalizing to new data.
The F-statistic was used to assess the joint explanatory power of the predictors. A large and statistically significant F-statistic justified retaining the model structure after simplification. In practice, if removing a variable caused a noticeable drop in the F-statistic or adjusted R², the change was reconsidered. This metric supported iterative decisions about which predictors enhanced model strength versus those that added complexity without sufficient benefit.
The number of predictors shaped the overall strategy by requiring a
balance between explanatory power and parsimony. Each additional
predictor increases model complexity and decreases degrees of freedom.
Redundant or highly collinear variables—such as
Surface_Area—were eliminated after variance inflation
factor (VIF) diagnostics flagged instability. This step improved model
interpretability while maintaining predictive performance.
This process was not purely mechanical; it required weighing statistical diagnostics against theoretical considerations from prior literature and domain knowledge. Ultimately, model refinement was guided by a desire for simplicity, statistical rigor, and practical insight into energy efficiency. These principles align with best practices described by Harrell (2015) and James et al. (2021), emphasizing clarity, reproducibility, and resistance to overfitting.
Aiken, L. S., West, S. G., & Reno, R. R. (1991). Multiple regression: Testing and interpreting interactions. Sage Publications.
Belsley, D. A., Kuh, E., & Welsch, R. E. (2005). Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons.
Crawley, D. B., Hand, J. W., Kummert, M., & Griffith, B. T. (2008). Contrasting the capabilities of building energy performance simulation programs. Building and Environment, 43(4), 661-673. https://doi.org/10.1016/j.buildenv.2006.10.027 Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage Publications.
Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., García Marquéz, J. R., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D., & Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27-46. https://doi.org/10.1111/j.1600-0587.2012.07348.x
Harrell, F. E. (2015). Regression modeling strategies: With applications to linear models, logistic and ordinal regression, and survival analysis (2nd ed.). Springer. https://doi.org/10.1007/978-3-319-19425-7
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
Tsanas, A., & Xifara, A. (2012). Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings, 49, 560–567. https://doi.org/10.1016/j.enbuild.2012.03.003
UCI Machine Learning Repository. (2012). Energy efficiency [Data set]. University of California, Irvine. https://archive.ics.uci.edu/dataset/242/energy+efficiency