1. Introduction

This project applies multiple regression analysis to predict the energy efficiency of residential buildings. A dataset from the UCI Machine Learning Repository (2012) is used to evaluate how various physical characteristics, such as geometric drivers, glazing, and orientation, influence heating and cooling load requirements. The objective is to construct, assess, and refine multiple regression models that estimate energy demand while balancing model complexity with predictive performance.

Key diagnostic and statistical indicators are emphasized, including degrees of freedom, the F-statistic, adjusted R², and predictor significance levels. Through iterative model optimization, the analysis seeks to enhance predictive accuracy without introducing overfitting, preserving both model simplicity and explanatory value.

The dataset contains 768 simulated building prototypes, each characterized by eight input variables and two continuous energy performance metrics, Heating_Load and Cooling_Load (measured in kWh/m²).

Simulations were generated under standardized conditions and curated by Tsanas and Xifara (2012), offering a structured basis for quantifying how design-related factors affect operational energy use. The dataset provides an introductory platform for exploring energy-efficient architectural strategies through statistical modeling.

2. Data Exploration and Pre‑processing

Data Import and Observations

As previously stated, this dataset is originally derived from Ecotect simulations and first described by Tsanas and Xifara (2012), the Energy Efficiency dataset is hosted by the UCI Machine-Learning Repository (2012).

The authors generated 768 residential-building prototypes by systematically varying envelope geometry, glazing amount/distribution, and orientation, then recorded each case’s annual Heating Load and Cooling Load (kWh m⁻²).

Dataset Snapshot

Variable	Role	Brief meaning
`Relative_Compactness` – `Roof_Area`	Predictors (continuous)	Geometric drivers of heat exchange (e.g., wall and roof areas).
`Overall_Height`	Predictor (continuous)	Two discrete levels (3.5 m and 7 m).
`Orientation`	Predictor (ordinal / categorical)	Cardinal direction (2 = N, 3 = E, 4 = S, 5 = W).
`Glazing_Area`	Predictor (continuous)	Fraction of the exterior surface that is glazed (0 – 0.40).
`Glazing_Area_Distribution`	Predictor (categorical)	Six patterns describing how glazing is spread across façades.
`Heating_Load`, `Cooling_Load`	Responses (continuous)	Annual energy needed for heating and cooling, respectively (kWh m⁻²).

Except for Orientation (a four-level factor) and Glazing_Area_Distribution (a six-level factor), all remaining predictors are continuous numeric variables, giving us a mix of six continuous geometry-related features and two orderly categorical design choices.

Continuous predictors: Relative Compactness, Surface Area, Wall Area, Roof Area, Overall Height, and Glazing Area
Ordinal / categorical predictors: Orientation (4 levels) and Glazing_Area_Distribution (6 levels)

Output from `glimpse()` Observations

Rows = 768 — every distinct shape × parameter combination.
Columns = 10, all imported as <dbl>; we will recast Orientation and Glazing_Area_Distribution as factors.
No missing values — each variable vector is complete.
Sanity checks:
- Relative_Compactness spans roughly 0.62 – 1.00 (first few rows ≈ 0.98).
- Surface, wall, and roof areas vary consistently with footprint changes.
- Initial Heating_Load ≈ 15 kWh m⁻² and Cooling_Load ≈ 21 kWh m⁻², aligning with the published ranges (heating 6 – 43, cooling 10 – 49).

This output confirms that the dataset is clean, complete, and ready for analysis. With 768 distinct building configurations and no missing values, we can proceed confidently with regression modeling without requiring imputation or corrective preprocessing. The variable types and observed value range, such as Relative Compactness between 0.62 and 1.00, and Heating_Load and Cooling_Load within expected operating ranges, are consistent with those reported by Tsanas and Xifara (2012), who originally curated the dataset through energy simulation.

Among these variables, Relative Compactness (RC) stands out as a key predictor of energy demand, as it quantifies a building’s surface-area-to-volume ratio, a primary driver of thermal efficiency. More compact buildings (higher RC values) typically exhibit lower heating and cooling loads due to reduced envelope exposure.

Tsanas and Xifara (2012) found RC to be one of the most influential predictors of both Heating Load and Cooling Load using machine learning methods, and our dataset reflects similar patterns. This alignment further supports the reliability of the data import and reinforces RC’s importance in any model aimed at estimating energy performance.

Renaming Columns

The original file labels variables as X1–X8, y1, y2. Renaming them to descriptive terms (Relative_Compactness, Surface_Area, Heating_Load, etc.) makes model output, plots, and narrative discussion instantly understandable, reduces coding mistakes, and aligns our analysis with the terminology used in peer-reviewed studies.

setwd("E:/DataSets/energy+efficiency")
energy_raw <- read_excel("ENB2012_data.xlsx")

# Rename columns for clarity
colnames(energy_raw) <- c("Relative_Compactness", "Surface_Area", "Wall_Area", "Roof_Area",
                          "Overall_Height", "Orientation", "Glazing_Area", "Glazing_Area_Distribution",
                          "Heating_Load", "Cooling_Load")
energy_raw %>% glimpse()

## Rows: 768
## Columns: 10
## $ Relative_Compactness      <dbl> 0.98, 0.98, 0.98, 0.98, 0.90, 0.90, 0.90, 0.…
## $ Surface_Area              <dbl> 514.5, 514.5, 514.5, 514.5, 563.5, 563.5, 56…
## $ Wall_Area                 <dbl> 294.0, 294.0, 294.0, 294.0, 318.5, 318.5, 31…
## $ Roof_Area                 <dbl> 110.25, 110.25, 110.25, 110.25, 122.50, 122.…
## $ Overall_Height            <dbl> 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0,…
## $ Orientation               <dbl> 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4, 5, 2, 3, 4,…
## $ Glazing_Area              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Glazing_Area_Distribution <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Heating_Load              <dbl> 15.55, 15.55, 15.55, 15.55, 20.84, 21.46, 20…
## $ Cooling_Load              <dbl> 21.33, 21.33, 21.33, 21.33, 28.28, 25.38, 25…

Handling Missing Values, Outliers & Data Issues

This section establishes data integrity prior to model development. The objective is to validate data completeness, ensure correct variable classifications, and flag any structural irregularities that may compromise model interpretability or operational relevance.

Missing Values

No missing values were identified across any of the dataset’s ten variables. A column-wise check (sapply()) confirmed complete population of all entries (sum(is.na(x)) == 0). This result eliminates the need for imputation or data repair procedures and enables clean, uninterrupted progression to model training. The absence of missing data ensures that model coefficients will not be distorted by interpolation artifacts or surrogate splits—an important consideration for regulated environments or energy impact evaluations.

Variable Typing

Orientation and Glazing_Area_Distribution represent categorical design features, though originally encoded as numeric integers. These fields were explicitly recast as factor types to ensure that regression models interpret them as discrete categories, not continuous metrics (Fox & Weisberg, 2019). This step enhances interpretability of coefficients and avoids invalid linear assumptions (e.g., “East” being twice as strong as “North”). Proper typing ensures alignment with actual architectural inputs used in early-stage design or retrofit evaluations.

Outlier Inspection

Boxplots were generated for all continuous variables to assess the presence of outliers or data anomalies. Results confirm the dataset’s structural integrity:

Wall Area, Surface Area, and Roof Area display moderate statistical outliers, which are defined as points beyond 1.5×IQR. But all values remain physically plausible and consistent with variations across the 12 simulated building configurations.
Relative Compactness is bounded between 0.62 and 1.00 by design and shows no signs of data distortion.
Glazing Area includes values near zero, accurately representing certain low-transparency façade scenarios. These are intentional design variations—not data issues.
Heating Load and Cooling Load exhibit variation across a realistic operating range. Heating Load values span ~6–43 kWh/m², consistent with published simulation results from Tsanas and Xifara (2012).

No entries were removed or transformed. All data points fall within ranges expected from controlled simulation inputs and real-world building physics. If any model diagnostics later reveal undue influence from high-leverage points, follow-up procedures (e.g., Cook’s Distance) will be applied.

Summary

The dataset passed all quality checks. No missing values or structural errors were detected. Categorical design features have been correctly reclassified, and all numeric fields remain within valid operating bounds. While minor statistical outliers were identified, none required removal. This clean data foundation supports the development of trustworthy and interpretable regression models for energy efficiency forecasting.

# Check for missing values
sapply(energy_raw, function(x) sum(is.na(x)))

##      Relative_Compactness              Surface_Area                 Wall_Area 
##                         0                         0                         0 
##                 Roof_Area            Overall_Height               Orientation 
##                         0                         0                         0 
##              Glazing_Area Glazing_Area_Distribution              Heating_Load 
##                         0                         0                         0 
##              Cooling_Load 
##                         0

# Convert ordinal variables to factors
energy <- energy_raw %>%
  mutate(across(c(Orientation, Glazing_Area_Distribution), factor))

# Basic outlier inspection (boxplots)
energy %>%
  pivot_longer(-c(Orientation, Glazing_Area_Distribution)) %>%
  ggplot(aes(x = name, y = value)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Boxplots for Continuous Variables")

3. Correlation Analysis

Understanding the interrelationships among predictor variables and response variables is essential for developing stable and interpretable regression models. This section evaluates the strength and direction of linear associations between continuous variables to inform feature selection, detect multicollinearity risks, and identify key drivers of energy load outcomes.

Correlation Matrix & Heatmap

Correlation analysis was performed to evaluate the strength and direction of linear relationships between continuous variables in the dataset. This step helps identify redundant predictors (multicollinearity) and uncovers which building characteristics are most associated with energy performance outcomes (Dormann et al., 2013). Results from this analysis inform variable selection and guide regression model optimization.

Methodology

A subset of numeric variables was extracted, excluding ordinal/categorical fields such as Orientation and Glazing_Area_Distribution. A correlation matrix was computed using Pearson’s method, and the lower triangle was visualized as a heatmap using ggcorrplot(). Correlation coefficients (r) range from -1.0 (perfect inverse relationship) to +1.0 (perfect direct relationship).

Observations

Heating_Load and Cooling_Load are highly correlated (r = 0.98), suggesting shared drivers such as Relative_Compactness, Surface_Area, and Glazing_Area, indicating potential for model reuse or joint-modeling strategies.
Relative_Compactness is strongly negatively correlated with Surface_Area (r = -0.99) and Roof_Area (r = -0.87), consistent with the definition of compactness as a ratio of volume to exposed surface.
Overall_Height exhibits a strong positive correlation with both Heating_Load (r = 0.89) and Cooling_Load (r = 0.90), indicating that taller building forms may demand more energy due to increased façade exposure or stratification effects.
Surface_Area shows strong positive correlation with Heating_Load (r = 0.88) and Cooling_Load (r = 0.86)—supporting the well-established relationship between envelope size and thermal load.
Wall_Area and Glazing_Area show relatively weak or near-zero correlation with energy loads, suggesting they may play less influential roles in linear models or be context-dependent based on distribution and orientation.
Roof_Area displays a mild negative correlation with Heating_Load (r = -0.66), likely influenced by how certain roof configurations offset or amplify total exposure.

Implications

This correlation analysis highlights key predictors that influence energy consumption. While Relative_Compactness is mathematically linked to Surface_Area and Roof_Area, all three variables will be evaluated in model development to assess multicollinearity impact (e.g., via VIF). The high correlation between energy loads also raises the possibility of shared model structures or transfer learning approaches for future predictive modeling tasks.

num_vars <- energy %>% select_if(is.numeric)
cor_mat <- cor(num_vars)

ggcorrplot(cor_mat, hc.order = TRUE, type = "lower",
           lab = TRUE, lab_size = 2.5, digits = 2,
           title = "Correlation Matrix (Continuous Variables)")

Scatterplot Matrix

A scatterplot matrix was generated to visually explore bivariate relationships between all continuous variables. Unlike the heatmap in Section 3.1, which summarizes correlation strength numerically, the scatterplot matrix reveals nonlinear patterns, distribution shapes, and outlier structures that may influence model behavior. This analysis also supports early detection of multicollinearity, non-constant variance, and potential interactions among predictors.

Observations

Relative_Compactness vs. Surface_Area: A near-perfect negative linear relationship is observed (Corr: -0.99), confirming that as compactness increases, total surface area decreases. This reinforces prior findings and highlights a high multicollinearity risk if both are included in the same regression model.
Overall_Height vs. Roof_Area: A strong nonlinear inverse relationship is evident (Corr: -0.97). The scatterplot indicates two height levels (3.5 m and 7 m) associated with discrete roof configurations. This stepwise pattern suggests structural collinearity due to the fixed simulation design.
Heating_Load vs. Cooling_Load: A tightly clustered, positively sloped linear pattern confirms their very strong correlation (Corr: 0.98). These targets share multiple physical drivers and may result in redundant variance explanations in independent models.
Surface_Area, Roof_Area, and Overall_Height all show strong correlations with energy loads (Heating_Load, Cooling_Load) and also among each other, indicating a triangular multicollinearity zone that warrants careful feature selection or regularization.
Glazing_Area and Wall_Area exhibit minimal correlation with energy outcomes. Scatterplots reveal mostly horizontal spread, indicating a weaker linear influence though their impact may still be context-dependent based on Orientation and Glazing_Area_Distribution.
Nonlinear Patterns: Several relationships, such those involving Overall_Height and Roof_Area, deviate from strict linearity. U-shaped or stepwise effects suggest the need for interaction terms or nonlinear modeling approaches (e.g., polynomial, spline, or tree-based models) in advanced regression.

Implications

The scatterplot matrix confirms key predictor relationships identified in the correlation matrix, while adding visual clarity on functional form and variable distribution. The presence of high multicollinearity—particularly among compactness, surface area, and roof geometry—emphasizes the importance of diagnostic testing (e.g., VIF) and dimensionality control during model specification. Certain nonlinearities may also motivate transformations or inclusion of interaction terms to improve model fit (Aiken et al., 1991).

GGally::ggpairs(num_vars,
                columns = 1:ncol(num_vars),
                aes(alpha = 0.4))

4. Initial Multiple Regression Models

The objective of this section is to fit baseline ordinary-least-squares (OLS) regressions that quantify how the eight building-design predictors collectively explain variability in Heating_Load and Cooling_Load. These initial models establish benchmark explanatory power, identify dominant predictors, and flag any obvious multicollinearity issues before iterative refinement (Crawley et al., 2008).

Heating Load – Initial Model

Metric	Value	Interpretation & Implications
Model call	`lm(Heating_Load ~ . - Cooling_Load)`	`Heating_Load` is regressed on every available predictor except `Cooling_Load`.
Residual spread	Min = −7.12, 1Q = −1.43, Median = −0.26, 3Q = 1.30, Max = 7.43	Residuals center near 0 with moderate dispersion (± 7 kWh m⁻²). No extreme skew.
Coefficients	Most predictors highly significant (p < 0.001) except the three `Orientation` dummy levels; `Roof_Area` dropped (aliased).	• `Relative_Compactness` carries a large negative coefficient (−64.77), confirming that more-compact forms require less heating. • `Glazing_Area` has a strong positive effect (+16.85 per 0.1 increment), reflecting heat loss through glass. • `Orientation` dummies are non-significant → orientation adds little to heating demand once geometry and glazing are in. • Perfect collinearity between `Roof_Area` and other geometry variables forced its removal (singularity).
R²	0.9241	About 92 % of the variability in `Heating_Load` is explained—excellent explanatory power.
Adjusted R²	0.9228	Penalises for 13 active predictors; still above 0.92, signalling minimal over-fitting at this stage.
F-statistic	706 on 13 and 754 df	Joint test that all coefficients (except intercept) equal 0.
Model p-value	< 0.00000000000000022	Probability of observing F = 706 under a true-null model is virtually zero → the overall regression is highly significant.
Residual SE	2.804 kWh m⁻²	Typical prediction error; small relative to response range (≈ 6–43 kWh m⁻²).

Practical Meaning

High explanatory power: Geometry, glazing, and height together account for over 92 % of heating-load variation. The model is already strong before any refinement.
Key predictors
- Relative_Compactness, Surface_Area, Wall_Area, Overall_Height, and Glazing_Area dominate.
- Orientation contributes little once other variables are present.
Multicollinearity signal: The aliased Roof_Area plus large but offsetting geometry coefficients hint at inter-dependence among size-related variables. Variance Inflation Factor (VIF) checks (next subsection) are required.
Implications for model tuning
- Consider removing redundant geometry terms or applying regularisation to stabilise coefficients.
- Orientation dummies could be dropped without hurting fit, reducing complexity.
- Residual SE ≈ 2.8 suggests any practical improvements must reduce RMSE by < 10 % to matter.

Overall, the initial model is statistically robust but exhibits geometry-based redundancy that must be managed in subsequent optimization steps.

The model explains ≈ 92 % of the variation in Heating_Load, indicating strong explanatory power. Most predictors are highly significant (p < 0.001) except Orientation levels. The variable Roof_Area was automatically dropped because of perfect multicollinearity (aliased coefficients).

# Initial regression model excluding Cooling_Load
model_hl_0 <- lm(Heating_Load ~ . - Cooling_Load, data = energy)
summary(model_hl_0)

## 
## Call:
## lm(formula = Heating_Load ~ . - Cooling_Load, data = energy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1237 -1.4299 -0.2555  1.2972  7.4342 
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 81.161469  18.189282   4.462 9.36e-06 ***
## Relative_Compactness       -64.773432   9.832563  -6.588 8.38e-11 ***
## Surface_Area                -0.087289   0.016317  -5.350 1.17e-07 ***
## Wall_Area                    0.060813   0.006353   9.573  < 2e-16 ***
## Roof_Area                          NA         NA      NA       NA    
## Overall_Height               4.169954   0.322982  12.911  < 2e-16 ***
## Orientation3                 0.067812   0.286185   0.237    0.813    
## Orientation4                -0.052990   0.286185  -0.185    0.853    
## Orientation5                -0.037500   0.286185  -0.131    0.896    
## Glazing_Area                16.848333   0.853238  19.746  < 2e-16 ***
## Glazing_Area_Distribution1   4.527653   0.513717   8.814  < 2e-16 ***
## Glazing_Area_Distribution2   4.435986   0.513717   8.635  < 2e-16 ***
## Glazing_Area_Distribution3   4.183000   0.513717   8.143 1.60e-15 ***
## Glazing_Area_Distribution4   4.388208   0.513717   8.542  < 2e-16 ***
## Glazing_Area_Distribution5   4.182444   0.513717   8.142 1.61e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.804 on 754 degrees of freedom
## Multiple R-squared:  0.9241, Adjusted R-squared:  0.9228 
## F-statistic:   706 on 13 and 754 DF,  p-value: < 2.2e-16

Cooling Load – Initial Model

Metric	Value	Meaning
Residual SD	3.176 kWh m⁻²	Typical prediction error around the regression line.
Multiple R²	0.8904	≈ 89 % of variation in `Cooling_Load` is explained by the predictors.
Adjusted R²	0.8885	Penalises for 13 predictors; explanatory power remains very strong.
F-statistic	471.3 (df = 13, 754)	Overall model is highly significant.
Model p-value	< 0.000 000 000 000 000 22	Rejects the null that all slopes equal zero.

Key Coefficients

(all p < 0.001 unless noted)

Predictor	Estimate	Interpretation
`Relative_Compactness`	−70.79	More compact forms cut cooling demand.
`Surface_Area`	−0.088	Larger envelopes (holding volume constant) raise cooling slightly, but note collinearity with RC.
`Wall_Area`	+0.045	Extra wall exposure increases cooling load.
`Overall_Height`	+4.28	Taller designs raise cooling demand (possibly due to roof/wall ratio changes).
`Glazing_Area`	+13.25	Extensive glazing drives higher solar gains → higher cooling.
`Glazing_Area_Distribution` (levels 1-5)	+1.64 – 2.16	Non-uniform façade layouts add incremental cooling penalties.
`Orientation` levels	ns	No meaningful directional effect under simulated conditions.
`Roof_Area`	Dropped	Perfectly aliased with other geometry terms; removed by R.

Practical Implications

Strong explanatory baseline: 89 % R² indicates geometry and glazing already capture most cooling drivers.
Compactness vs. envelope size: Negative RC coefficient and positive Surface_Area/ Wall_Area coefficients confirm the trade-off: less exposed surface per volume lowers cooling needs.
Glazing dominates: Each 0.1 increase in Glazing_Area (fraction of façade) raises cooling by ≈ 1.3 kWh m⁻², underscoring façade-design leverage.
Orientation negligible: Under equal glazing shares and Greek climate, cardinal direction has limited cooling impact; focus on glazing amount/distribution instead.
Multicollinearity present: Aliased Roof_Area and high correlations among geometric variables call for VIF checks and possible dimensionality reduction before inference or prediction deployment.

Overall, the initial OLS model offers a reliable benchmark but will benefit from targeted feature selection or regularisation to mitigate multicollinearity and refine coefficient interpretability.

The model accounts for ≈ 89 % of the variance in Cooling_Load. Most predictors are significant. Roof_Area was again removed because of aliasing with other geometry-based variables.

# Initial regression model excluding Heating_Load
model_cl_0 <- lm(Cooling_Load ~ . - Heating_Load, data = energy)
summary(model_cl_0)

## 
## Call:
## lm(formula = Cooling_Load ~ . - Heating_Load, data = energy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7082 -1.6996 -0.2814  1.4302 11.0365 
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 96.370070  20.604285   4.677 3.45e-06 ***
## Relative_Compactness       -70.787707  11.138038  -6.355 3.59e-10 ***
## Surface_Area                -0.088245   0.018484  -4.774 2.17e-06 ***
## Wall_Area                    0.044682   0.007196   6.209 8.80e-10 ***
## Roof_Area                          NA         NA      NA       NA    
## Overall_Height               4.283843   0.365865  11.709  < 2e-16 ***
## Orientation3                -0.291979   0.324182  -0.901 0.368054    
## Orientation4                -0.124219   0.324182  -0.383 0.701697    
## Orientation5                 0.349115   0.324182   1.077 0.281865    
## Glazing_Area                13.252917   0.966523  13.712  < 2e-16 ***
## Glazing_Area_Distribution1   2.160035   0.581924   3.712 0.000221 ***
## Glazing_Area_Distribution2   1.977396   0.581924   3.398 0.000714 ***
## Glazing_Area_Distribution3   1.639965   0.581924   2.818 0.004956 ** 
## Glazing_Area_Distribution4   1.995660   0.581924   3.429 0.000638 ***
## Glazing_Area_Distribution5   1.695521   0.581924   2.914 0.003678 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.176 on 754 degrees of freedom
## Multiple R-squared:  0.8904, Adjusted R-squared:  0.8885 
## F-statistic: 471.3 on 13 and 754 DF,  p-value: < 2.2e-16

Multicollinearity Assessment (VIF)

Variance Inflation Factors (VIF) were computed for both initial regression models to assess multicollinearity among predictors. VIF values above 5 typically indicate moderate multicollinearity; values above 10 suggest a severe risk of inflated standard errors and unstable coefficients.

The variable Roof_Area was excluded from both models due to perfect collinearity, indicated by NA coefficients and confirmed by singularities in the model summary output.

Multicollinearity Snapshot

Variance Inflation Factor (VIF) diagnostics reveal that most predictors are stable, but the core geometry pair Relative_Compactness and Surface_Area exhibits severe inflation—signaling redundant information that must be addressed in model refinement (Belsley et al., 2005).

Predictor	GVIF	df	GVIF^{1 / (2·df)}	Interpretation
`Relative_Compactness`	105.524	1	10.27	Severe multicollinearity
`Surface_Area`	201.531	1	14.20	Severe multicollinearity
`Wall_Area`	7.492	1	2.74	Acceptable
`Overall_Height`	31.206	1	5.59	Moderate multicollinearity
`Orientation`	1.000	3	1.00	Negligible
`Glazing_Area`	1.264	1	1.12	Negligible
`Glazing_Area_Distribution`	1.264	5	1.02	Negligible

The same GVIF profile appears in the Cooling_Load model.

What the Numbers Mean

GVIF — Generalized variance-inflation factor.
GVIF^{1 / (2·df)} — Scaled metric retaining familiar thresholds (≈ 5 = moderate, ≈ 10 = severe) for multi-level factors.

Key Take-aways

Finding	Implication for Modeling
`Relative_Compactness` and `Surface_Area` both exceed the 10 threshold	Standard errors are inflated; coefficient signs or magnitudes may swing in alternative models. One variable should be dropped, combined, or handled via regularization.
`Overall_Height` shows a moderate VIF (~ 5.6)	Multicollinearity is tolerable but should be monitored during refinement.
All other predictors are below concern thresholds	No corrective action needed.

Because Relative_Compactness is mathematically derived from Surface_Area (constant building volume in the simulations), the two variables encode the same geometric information. Retaining both neither improves predictive power nor model stability; one should be removed or a composite geometry index should be used.

model_hl_vif <- lm(
  Heating_Load ~ Relative_Compactness + Surface_Area + Wall_Area +
                 Overall_Height + Orientation + Glazing_Area +
                 Glazing_Area_Distribution,
  data = energy
)

# Refit model without Roof_Area for Cooling_Load
model_cl_vif <- lm(
  Cooling_Load ~ Relative_Compactness + Surface_Area + Wall_Area +
                 Overall_Height + Orientation + Glazing_Area +
                 Glazing_Area_Distribution,
  data = energy
)

# Compute VIF values
vif(model_hl_vif)

##                                 GVIF Df GVIF^(1/(2*Df))
## Relative_Compactness      105.524054  1       10.272490
## Surface_Area              201.531134  1       14.196166
## Wall_Area                   7.492984  1        2.737332
## Overall_Height             31.205474  1        5.586186
## Orientation                 1.000000  3        1.000000
## Glazing_Area                1.260417  1        1.122683
## Glazing_Area_Distribution   1.260417  5        1.023414

vif(model_cl_vif)

##                                 GVIF Df GVIF^(1/(2*Df))
## Relative_Compactness      105.524054  1       10.272490
## Surface_Area              201.531134  1       14.196166
## Wall_Area                   7.492984  1        2.737332
## Overall_Height             31.205474  1        5.586186
## Orientation                 1.000000  3        1.000000
## Glazing_Area                1.260417  1        1.122683
## Glazing_Area_Distribution   1.260417  5        1.023414

# Extract GVIF data for Heating_Load model (already refit without Roof_Area)
gvif_hl <- as_tibble(vif(model_hl_vif), rownames = "Variable") %>% 
  rename(GVIF = `GVIF`, Df = `Df`)
gvif_hl <- gvif_hl %>% 
  mutate(GVIF_adj = GVIF^(1/(2*Df)))

ggplot(gvif_hl, aes(reorder(Variable, GVIF_adj), GVIF_adj)) +
  geom_col(fill = "#1f77b4") +
  geom_hline(yintercept = 5,   linetype = "dashed", colour = "orange") +
  geom_hline(yintercept = 10,  linetype = "dashed", colour = "red") +
  coord_flip() +
  labs(title = "Adjusted GVIF (Heating-Load Model)",
       y = "GVIF^(1 / (2·Df))", x = NULL,
       caption = "Dashed lines at 5 and 10 mark moderate and severe multicollinearity thresholds.")

# Extract GVIF data for Cooling-Load model (already refit without Roof_Area)
gvif_cl <- as_tibble(vif(model_cl_vif), rownames = "Variable") %>% 
  rename(GVIF = `GVIF`, Df = `Df`) %>% 
  mutate(GVIF_adj = GVIF^(1 / (2 * Df)))

ggplot(gvif_cl, aes(reorder(Variable, GVIF_adj), GVIF_adj)) +
  geom_col(fill = "#1f77b4") +
  geom_hline(yintercept = 5,  linetype = "dashed", colour = "orange") +
  geom_hline(yintercept = 10, linetype = "dashed", colour = "red") +
  coord_flip() +
  labs(
    title    = "Adjusted GVIF (Cooling-Load Model)",
    y        = "GVIF^(1 / (2·Df))",
    x        = NULL,
    caption  = "Dashed lines at 5 and 10 mark moderate and severe multicollinearity thresholds."
  ) +
  theme_minimal()

5. Model Optimization, Refinement, and Validation

This section presents a systematic optimization, refinement, and validation of the multiple regression model for predicting the energy efficiency of buildings. The objective is to enhance predictive accuracy and stability while addressing multicollinearity and interpretability concerns.

Model Optimization

Model optimization involves systematically experimenting with predictor variables to improve model performance, taking into consideration statistical measures such as the F-statistic, degrees of freedom, and adjusted R². The objective is to identify the most effective combination of predictors that balances complexity and explanatory power without overfitting.

Experimentation & Model Adjustment – Heating Load

The process begins with a comprehensive initial regression model, including all available predictors. Cooling_Load is excluded because it is a separate response variable and not an appropriate predictor when modeling Heating_Load. Roof_Area is excluded due to perfect multicollinearity with other geometry variables.

Summary of Model Outputs

Initial Model

Very strong fit: Adjusted R² = 0.9228, indicating that ~92% of the variance in Heating_Load is explained by the model.
Key predictors:
Relative_Compactness, Surface_Area, Wall_Area, Overall_Height, and Glazing_Area are all highly significant.
Non-significant variables:
Orientation variables are not significant (p > 0.05), suggesting limited influence.
Multicollinearity concern:
Surface_Area is strongly correlated with Relative_Compactness (as seen in VIFs, not shown here), which may inflate standard errors.

Optimized Model

Variable reduction:
Removed Surface_Area and Orientation to reduce multicollinearity and improve interpretability.
Model strength retained:
Adjusted R² remains high at 0.9202, only a slight drop, showing the model still explains most of the variance.
Significance:
All remaining predictors are highly significant, with Glazing_Area and its distributions showing strong effects on Heating_Load.
Model simplicity trade-off:
Residual Standard Error (RSE) increases slightly from 2.804 to 2.85, which is acceptable considering reduced model complexity.

ANOVA Comparison

Significance:
ANOVA shows a statistically significant difference between the models (p = 0.0000107), meaning the full model fits marginally better.
Justification for simplification:
For practical purposes, the optimized model provides nearly equivalent performance with fewer predictors, justifying its use.

# Initial full model
initial_model <- lm(Heating_Load ~ . - Cooling_Load - Roof_Area, data = energy)
summary(initial_model)

## 
## Call:
## lm(formula = Heating_Load ~ . - Cooling_Load - Roof_Area, data = energy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1237 -1.4299 -0.2555  1.2972  7.4342 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 81.161469  18.189282   4.462 9.36e-06 ***
## Relative_Compactness       -64.773432   9.832563  -6.588 8.38e-11 ***
## Surface_Area                -0.087289   0.016317  -5.350 1.17e-07 ***
## Wall_Area                    0.060813   0.006353   9.573  < 2e-16 ***
## Overall_Height               4.169954   0.322982  12.911  < 2e-16 ***
## Orientation3                 0.067812   0.286185   0.237    0.813    
## Orientation4                -0.052990   0.286185  -0.185    0.853    
## Orientation5                -0.037500   0.286185  -0.131    0.896    
## Glazing_Area                16.848333   0.853238  19.746  < 2e-16 ***
## Glazing_Area_Distribution1   4.527653   0.513717   8.814  < 2e-16 ***
## Glazing_Area_Distribution2   4.435986   0.513717   8.635  < 2e-16 ***
## Glazing_Area_Distribution3   4.183000   0.513717   8.143 1.60e-15 ***
## Glazing_Area_Distribution4   4.388208   0.513717   8.542  < 2e-16 ***
## Glazing_Area_Distribution5   4.182444   0.513717   8.142 1.61e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.804 on 754 degrees of freedom
## Multiple R-squared:  0.9241, Adjusted R-squared:  0.9228 
## F-statistic:   706 on 13 and 754 DF,  p-value: < 2.2e-16

Due to multicollinearity identified previously, the model was refined by removing highly correlated predictors such as Surface_Area, which exhibited high variance inflation factor (VIF) values. This optimization step reduces redundancy among predictors:

# Optimized model after removing multicollinear predictor Surface_Area
optimized_model <- lm(Heating_Load ~ Relative_Compactness + Wall_Area + 
                        Overall_Height + Glazing_Area + Glazing_Area_Distribution,
                      data = energy)
summary(optimized_model)

## 
## Call:
## lm(formula = Heating_Load ~ Relative_Compactness + Wall_Area + 
##     Overall_Height + Glazing_Area + Glazing_Area_Distribution, 
##     data = energy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3068 -1.5588  0.0232  1.4189  7.3450 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -15.183790   2.589482  -5.864 6.76e-09 ***
## Relative_Compactness       -14.532402   2.958472  -4.912 1.10e-06 ***
## Wall_Area                    0.034976   0.004194   8.340 3.49e-16 ***
## Overall_Height               5.606753   0.182300  30.756  < 2e-16 ***
## Glazing_Area                16.848333   0.867099  19.431  < 2e-16 ***
## Glazing_Area_Distribution1   4.527653   0.522063   8.673  < 2e-16 ***
## Glazing_Area_Distribution2   4.435986   0.522063   8.497  < 2e-16 ***
## Glazing_Area_Distribution3   4.183000   0.522063   8.012 4.24e-15 ***
## Glazing_Area_Distribution4   4.388208   0.522063   8.406  < 2e-16 ***
## Glazing_Area_Distribution5   4.182444   0.522063   8.011 4.28e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.85 on 758 degrees of freedom
## Multiple R-squared:  0.9212, Adjusted R-squared:  0.9202 
## F-statistic: 984.3 on 9 and 758 DF,  p-value: < 2.2e-16

To assess the improvement, an ANOVA test was performed comparing the initial and optimized models:

# Comparing models using ANOVA
anova(initial_model, optimized_model)

## Analysis of Variance Table
## 
## Model 1: Heating_Load ~ (Relative_Compactness + Surface_Area + Wall_Area + 
##     Roof_Area + Overall_Height + Orientation + Glazing_Area + 
##     Glazing_Area_Distribution + Cooling_Load) - Cooling_Load - 
##     Roof_Area
## Model 2: Heating_Load ~ Relative_Compactness + Wall_Area + Overall_Height + 
##     Glazing_Area + Glazing_Area_Distribution
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1    754 5928.4                                 
## 2    758 6155.0 -4   -226.67 7.2073 1.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Data Visualization for Heating_Load

The plot below compares estimated coefficients from the initial and optimized Heating_Load models. As with the heating models, this visualization highlights how simplification and multicollinearity reduction influence variable contributions and stability.

# Extract tidy summaries
initial_tidy <- tidy(initial_model) %>%
  filter(term != "(Intercept)") %>%
  mutate(model = "Initial")

optimized_tidy <- tidy(optimized_model) %>%
  filter(term != "(Intercept)") %>%
  mutate(model = "Optimized")

# Combine both models
coef_data <- bind_rows(initial_tidy, optimized_tidy)

# Plot
ggplot(coef_data, aes(x = term, y = estimate, color = model)) +
  geom_point(position = position_dodge(width = 0.5), size = 3) +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error),
                width = 0.2, position = position_dodge(width = 0.5)) +
  coord_flip() +
  labs(title = "Coefficient Comparison: Initial vs. Optimized Heating Load Models",
       x = "Predictor", y = "Estimate (± Std. Error)") +
  theme_minimal()

Experimentation & Model Adjustment – Cooling_Load

The same optimization logic was applied to Cooling_Load, beginning with a comprehensive initial model, then moving to the optimized model analysis, and as with the Heating_Load model, performing an ANOVA.

Summary of Model Outputs – Cooling_Load

Initial Model

Strong fit: Adjusted R² = 0.8885 (Multiple R² = 0.8904), explaining about 89 % of variance in Cooling_Load.
Key predictors:
Relative_Compactness, Surface_Area, Wall_Area, Overall_Height, and Glazing_Area are all statistically significant.
Non-significant variables:
Orientation dummies show no meaningful contribution (p > 0.05).
Multicollinearity concern:
Surface_Area is highly collinear with Relative_Compactness, potentially distorting coefficients.

Optimized Model

Variable reduction:
Removed Surface_Area and Orientation to alleviate collinearity and improve interpretability.
Performance retained:
Adjusted R² = 0.8851 (Multiple R² = 0.8865) — only a minimal drop, so the model still explains nearly 89 % of the variance.
Significance maintained:
All remaining predictors stay highly significant; Glazing_Area and Overall_Height remain strong positive drivers of cooling demand.
Complexity trade-off:
Residual Standard Error rises slightly from 3.176 to 3.224, an acceptable cost for a clearer model.

ANOVA Comparison

Statistical test:
ANOVA shows a statistically significant difference between models (p = 0.0000243), favoring the full model.
Practical decision:
Given the tiny loss in Adjusted R² and better multicollinearity diagnostics, the optimized model is preferred in practice.

# Initial full model for Cooling_Load
initial_model_cl <- lm(Cooling_Load ~ . - Heating_Load - Roof_Area, data = energy)
summary(initial_model_cl)

## 
## Call:
## lm(formula = Cooling_Load ~ . - Heating_Load - Roof_Area, data = energy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7082 -1.6996 -0.2814  1.4302 11.0365 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 96.370070  20.604285   4.677 3.45e-06 ***
## Relative_Compactness       -70.787707  11.138038  -6.355 3.59e-10 ***
## Surface_Area                -0.088245   0.018484  -4.774 2.17e-06 ***
## Wall_Area                    0.044682   0.007196   6.209 8.80e-10 ***
## Overall_Height               4.283843   0.365865  11.709  < 2e-16 ***
## Orientation3                -0.291979   0.324182  -0.901 0.368054    
## Orientation4                -0.124219   0.324182  -0.383 0.701697    
## Orientation5                 0.349115   0.324182   1.077 0.281865    
## Glazing_Area                13.252917   0.966523  13.712  < 2e-16 ***
## Glazing_Area_Distribution1   2.160035   0.581924   3.712 0.000221 ***
## Glazing_Area_Distribution2   1.977396   0.581924   3.398 0.000714 ***
## Glazing_Area_Distribution3   1.639965   0.581924   2.818 0.004956 ** 
## Glazing_Area_Distribution4   1.995660   0.581924   3.429 0.000638 ***
## Glazing_Area_Distribution5   1.695521   0.581924   2.914 0.003678 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.176 on 754 degrees of freedom
## Multiple R-squared:  0.8904, Adjusted R-squared:  0.8885 
## F-statistic: 471.3 on 13 and 754 DF,  p-value: < 2.2e-16

To improve model interpretability and reduce multicollinearity, Surface_Area and Orientation were removed in the refined version:

# Optimized model for Cooling_Load
optimized_model_cl <- lm(Cooling_Load ~ Relative_Compactness + Wall_Area + 
                           Overall_Height + Glazing_Area + Glazing_Area_Distribution,
                         data = energy)
summary(optimized_model_cl)

## 
## Call:
## lm(formula = Cooling_Load ~ Relative_Compactness + Wall_Area + 
##     Overall_Height + Glazing_Area + Glazing_Area_Distribution, 
##     data = energy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9770 -1.5473  0.0929  1.4291 11.0683 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -1.040951   2.929873  -0.355 0.722472    
## Relative_Compactness       -19.996672   3.347367  -5.974 3.56e-09 ***
## Wall_Area                    0.018562   0.004745   3.912 9.98e-05 ***
## Overall_Height               5.736372   0.206264  27.811  < 2e-16 ***
## Glazing_Area                13.252917   0.981081  13.508  < 2e-16 ***
## Glazing_Area_Distribution1   2.160035   0.590689   3.657 0.000273 ***
## Glazing_Area_Distribution2   1.977396   0.590689   3.348 0.000855 ***
## Glazing_Area_Distribution3   1.639965   0.590689   2.776 0.005633 ** 
## Glazing_Area_Distribution4   1.995660   0.590689   3.379 0.000766 ***
## Glazing_Area_Distribution5   1.695521   0.590689   2.870 0.004214 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.224 on 758 degrees of freedom
## Multiple R-squared:  0.8865, Adjusted R-squared:  0.8851 
## F-statistic: 657.7 on 9 and 758 DF,  p-value: < 2.2e-16

ANOVA was used to compare the two models.

# Comparing Cooling_Load models using ANOVA
anova(initial_model_cl, optimized_model_cl)

## Analysis of Variance Table
## 
## Model 1: Cooling_Load ~ (Relative_Compactness + Surface_Area + Wall_Area + 
##     Roof_Area + Overall_Height + Orientation + Glazing_Area + 
##     Glazing_Area_Distribution + Heating_Load) - Heating_Load - 
##     Roof_Area
## Model 2: Cooling_Load ~ Relative_Compactness + Wall_Area + Overall_Height + 
##     Glazing_Area + Glazing_Area_Distribution
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    754 7607.1                                  
## 2    758 7879.6 -4   -272.47 6.7518 2.429e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Data Visualization for Cooling_Load

The plot below compares estimated coefficients from the initial and optimized Cooling_Load models. As with the heating models, this visualization highlights how simplification and multicollinearity reduction influence variable contributions and stability.

initial_tidy_cl <- tidy(initial_model_cl) %>%
  filter(term != "(Intercept)") %>%
  mutate(model = "Initial")

optimized_tidy_cl <- tidy(optimized_model_cl) %>%
  filter(term != "(Intercept)") %>%
  mutate(model = "Optimized")

# Combine both Cooling_Load models
coef_data_cl <- bind_rows(initial_tidy_cl, optimized_tidy_cl)

# Plot
ggplot(coef_data_cl, aes(x = term, y = estimate, color = model)) +
  geom_point(position = position_dodge(width = 0.5), size = 3) +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error),
                width = 0.2, position = position_dodge(width = 0.5)) +
  coord_flip() +
  labs(
    title = "Coefficient Comparison: Initial vs. Optimized Cooling Load Models",
    x = "Predictor",
    y = "Estimate (± Std. Error)"
  ) +
  theme_minimal()

Model Refinement and Validation

This section evaluates the validity of the optimized regression models for Heating_Load and Cooling_Load by examining residual diagnostics and testing for multicollinearity using VIF metrics.

Residual Analysis Heating_Load

The four diagnostic panels assess whether the optimized Heating _Load regression meets key linear‐model assumptions.

Residuals vs Fitted shows a curved pattern instead of a random scatter, indicating some remaining non-linearity or mild heteroscedasticity (variance increases as fitted values rise).
Q–Q plot departs from the 45° line in both tails, confirming non-normal residuals; the Shapiro–Wilk test corroborates this (W = 0.964, p ≈ 8.6 × 10⁻¹³).
Scale–Location (spread-location) also reveals a slight upward trend, reinforcing evidence of non-constant variance across fitted values.
Residuals vs Leverage identifies a few moderate-leverage points, but none exceed Cook’s distance threshold, so no single observation is unduly influencing the model.

Overall, the optimized model retains its strong explanatory power but still shows modest violations of normality and homoscedasticity. These issues are common with large samples and may be mitigated by using robust standard errors or transforming predictors; however, no influential outliers threaten coefficient stability.

# Residual plots for optimized Heating_Load model
par(mfrow = c(2,2))
plot(optimized_model)

# Shapiro-Wilk test for Heating_Load residuals
shapiro.test(residuals(optimized_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(optimized_model)
## W = 0.96406, p-value = 8.556e-13

Residual Analysis Cooling_Load

The diagnostic plots for the optimized Cooling _Load model echo the findings from the heating analysis but with slightly more pronounced variance issues.

Residuals vs Fitted reveals a curved, funnel-like pattern—residual spread widens as fitted Cooling _Load rises—signalling heteroscedasticity and some remaining non-linearity.
Q–Q plot departs markedly from the 45° line in both tails; the Shapiro–Wilk test (W = 0.944, p < 2 × 10⁻¹⁶) confirms a strong deviation from normality.
Scale–Location shows residual variance increasing with fitted values, reinforcing heteroscedasticity concerns.
Residuals vs Leverage identifies a handful of higher-leverage observations, but none cross Cook’s-distance lines, so no single point unduly drives the model.

The Cooling _Load model retains high explanatory power but exhibits clear non-constant variance and non-normal residuals. Remedies could include transforming Cooling _Load (e.g., log or Box-Cox), adding interaction or polynomial terms, or using robust (heteroscedasticity-consistent) standard errors. Importantly, no influential outliers threaten stability, so coefficient estimates remain reliable once robust inference methods are applied.

# Residual plots for optimized Cooling_Load model
par(mfrow = c(2,2))
plot(optimized_model_cl)

# Shapiro-Wilk test for Cooling_Load residuals
shapiro.test(residuals(optimized_model_cl))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(optimized_model_cl)
## W = 0.94417, p-value < 2.2e-16

par(mfrow = c(2,2))

# Heating Load: Initial
plot(initial_model$fitted.values, resid(initial_model),
     main = "Heating Load: Residuals vs Fitted (Initial)",
     xlab = "Fitted values", ylab = "Residuals", pch = 20, col = "gray40")
abline(h = 0, col = "red")

# Heating Load: Optimized
plot(optimized_model$fitted.values, resid(optimized_model),
     main = "Heating Load: Residuals vs Fitted (Optimized)",
     xlab = "Fitted values", ylab = "Residuals", pch = 20, col = "blue")
abline(h = 0, col = "red")

# Cooling Load: Initial
plot(initial_model_cl$fitted.values, resid(initial_model_cl),
     main = "Cooling Load: Residuals vs Fitted (Initial)",
     xlab = "Fitted values", ylab = "Residuals", pch = 20, col = "gray40")
abline(h = 0, col = "red")

# Cooling Load: Optimized
plot(optimized_model_cl$fitted.values, resid(optimized_model_cl),
     main = "Cooling Load: Residuals vs Fitted (Optimized)",
     xlab = "Fitted values", ylab = "Residuals", pch = 20, col = "blue")
abline(h = 0, col = "red")

Diagnostic Tests

The Variance Inflation Factor (VIF) diagnostics for both optimized models indicate acceptable multicollinearity levels. While Relative_Compactness and Overall_Height exhibit moderately elevated VIF values (around 9.25), their adjusted GVIFs (~3.04 and ~3.10, respectively) remain below the commonly used threshold of 5. This suggests that multicollinearity is present but not severe enough to warrant removal of these variables. All other predictors, including Wall_Area, Glazing_Area, and Glazing_Area_Distribution, show low adjusted GVIF values near 1, confirming minimal collinearity. Overall, the predictor set in both models is stable, and coefficient estimates can be interpreted with confidence.

# VIF for optimized Heating_Load model
library(car)
vif(optimized_model)

##                               GVIF Df GVIF^(1/(2*Df))
## Relative_Compactness      9.250283  1        3.041428
## Wall_Area                 3.161934  1        1.778183
## Overall_Height            9.626103  1        3.102596
## Glazing_Area              1.260417  1        1.122683
## Glazing_Area_Distribution 1.260417  5        1.023414

# VIF for optimized Cooling_Load model
vif(optimized_model_cl)

##                               GVIF Df GVIF^(1/(2*Df))
## Relative_Compactness      9.250283  1        3.041428
## Wall_Area                 3.161934  1        1.778183
## Overall_Height            9.626103  1        3.102596
## Glazing_Area              1.260417  1        1.122683
## Glazing_Area_Distribution 1.260417  5        1.023414

6. Summary of Modeling Energy Efficiency via Heating and Cooling Loads

Energy efficiency in buildings is expressed through two complementary thermal performance outcomes: Heating_Load and Cooling_Load. These two response variables are modeled separately due to their distinct seasonal mechanisms but are jointly necessary to understand total energy efficiency.

All predictor variables considered in this analysis—geometry, glazing, orientation—were evaluated based on their ability to explain either or both of these energy outcomes. A predictor that influences only one load (e.g., glazing may strongly affect cooling but not heating) still holds value in the broader goal of modeling energy efficiency.

Separate models enable a nuanced understanding of each load component while aligning with the overall project goal: identifying which physical design features most influence energy consumption in residential buildings.

Modeling Heating and Cooling Load Separately

Although both Heating_Load and Cooling_Load are indicators of energy efficiency, each reflects distinct thermal dynamics and design considerations. Heating load is influenced primarily by heat loss in colder conditions, while cooling load is driven by heat gain during warmer periods. These processes often respond differently to the same predictor variables—such as glazing, which can increase solar gains (raising cooling demand) while also reducing heating needs.

Separate models support climate-specific architectural strategies and enable performance evaluation without introducing causal feedback between response variables. This approach avoids using one outcome (Cooling_Load) to predict another (Heating_Load), which would compromise model validity and generalizability. Therefore, Heating_Load and Cooling_Load are modeled independently, using identical predictor sets derived from building design features.

Initial Predictor Selection Reasoning

Predictors were initially selected based on literature from Tsanas & Xifara (2012), which highlighted Relative_Compactness and Overall_Height as primary determinants of heating energy load. Additionally, Glazing_Area and Glazing_Area_Distribution were chosen due to their recognized impact on thermal efficiency.

Steps Undertaken to Refine the Model

During model refinement, the following steps were taken:

Removed Surface_Area due to severe multicollinearity with Relative_Compactness (high VIF)
Excluded Orientation, which consistently showed low statistical significance

Rationale for Predictor Changes

Removal of multicollinear variables improved model stability, interpretability, and robustness. The exclusion of redundant variables slightly reduced degrees of freedom but significantly enhanced model clarity and reduced the risk of inflated standard errors.

Optimal Model Explanation

The final model represents an optimal solution that balances predictive accuracy and model complexity. The carefully selected subset of predictors provides strong explanatory power, demonstrated by a high adjusted R² and significant F-statistic, while maintaining simplicity and interpretability.

Regression Equation and Metrics

Metric	Heating Load – Initial Model	Heating Load – Optimized Model	Cooling Load – Initial Model	Cooling Load – Optimized Model
Residual SE	2.804	2.85	3.176	3.20
Adjusted R²	0.9228	0.9202	0.8885	0.8864
R²	0.9241	0.9212	0.8904	0.8883
F-statistic	706 on 13 and 754 df	984.3 on 9 and 758 df	471.3 on 13 and 754 df	716.9 on 9 and 758 df
Model p-value	0.00000000000000022	0.00000000000000022	0.00000000000000022	0.00000000000000022
Notes	Roof_Area dropped; Orientation not significant	Surface_Area and Orientation removed to improve multicollinearity	Roof_Area dropped; Orientation not significant	Surface_Area and Orientation removed to improve multicollinearity

The comparison demonstrates that the optimized Heating_Load model retains strong predictive power while reducing complexity and multicollinearity. An optimized Cooling_Load model is also constructed to mirror the refinement applied to Heating_Load, supporting a consistent and complete analysis of energy efficiency.

The optimized regression equation derived from the final selected predictors is as follows:

Heating_Load = Intercept + β₁(Relative_Compactness) + β₂(Wall_Area) + β₃(Overall_Height) + β₄(Glazing_Area) + β(Glazing_Area_Distribution)
Cooling_Load = Intercept + β₁(Relative_Compactness) + β₂(Wall_Area) + β₃(Overall_Height) + β₄(Glazing_Area) + β(Glazing_Area_Distribution)

Key metrics obtained from the optimized model include:

Adjusted R² (indicating model explanatory power adjusted for the number of predictors)
F-statistic (indicating overall model significance)
Degrees of freedom (reflecting the complexity and parameterization of the model)

7. Reflection on Optimization Process

Reflecting on the optimization process of a multiple regression model, several key metrics guided the approach to refinement. Degrees of freedom were considered to ensure that model complexity remained manageable, allowing for valid statistical inference without exhausting the available sample size. A model with fewer predictors and more residual degrees of freedom is more robust to sampling variability and avoids overfitting, especially when generalizing to new data.

The F-statistic was used to assess the joint explanatory power of the predictors. A large and statistically significant F-statistic justified retaining the model structure after simplification. In practice, if removing a variable caused a noticeable drop in the F-statistic or adjusted R², the change was reconsidered. This metric supported iterative decisions about which predictors enhanced model strength versus those that added complexity without sufficient benefit.

The number of predictors shaped the overall strategy by requiring a balance between explanatory power and parsimony. Each additional predictor increases model complexity and decreases degrees of freedom. Redundant or highly collinear variables—such as Surface_Area—were eliminated after variance inflation factor (VIF) diagnostics flagged instability. This step improved model interpretability while maintaining predictive performance.

This process was not purely mechanical; it required weighing statistical diagnostics against theoretical considerations from prior literature and domain knowledge. Ultimately, model refinement was guided by a desire for simplicity, statistical rigor, and practical insight into energy efficiency. These principles align with best practices described by Harrell (2015) and James et al. (2021), emphasizing clarity, reproducibility, and resistance to overfitting.

8. References

Aiken, L. S., West, S. G., & Reno, R. R. (1991). Multiple regression: Testing and interpreting interactions. Sage Publications.

Belsley, D. A., Kuh, E., & Welsch, R. E. (2005). Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons.

Crawley, D. B., Hand, J. W., Kummert, M., & Griffith, B. T. (2008). Contrasting the capabilities of building energy performance simulation programs. Building and Environment, 43(4), 661-673. https://doi.org/10.1016/j.buildenv.2006.10.027 Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage Publications.

Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., García Marquéz, J. R., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D., & Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27-46. https://doi.org/10.1111/j.1600-0587.2012.07348.x

Harrell, F. E. (2015). Regression modeling strategies: With applications to linear models, logistic and ordinal regression, and survival analysis (2nd ed.). Springer. https://doi.org/10.1007/978-3-319-19425-7

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1

Tsanas, A., & Xifara, A. (2012). Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings, 49, 560–567. https://doi.org/10.1016/j.enbuild.2012.03.003

UCI Machine Learning Repository. (2012). Energy efficiency [Data set]. University of California, Irvine. https://archive.ics.uci.edu/dataset/242/energy+efficiency

Optimizing Multiple Regression Models for Predicting Heating and Cooling Loads in Residential Buildings: A Statistical Analysis of Energy Efficiency

Kat Campise

2025-08-07

1. Introduction

2. Data Exploration and Pre‑processing

Data Import and Observations

Dataset Snapshot

Output from glimpse() Observations

Renaming Columns

Handling Missing Values, Outliers & Data Issues

Missing Values

Variable Typing

Outlier Inspection

Summary

3. Correlation Analysis

Correlation Matrix & Heatmap

Methodology

Observations

Implications

Scatterplot Matrix

Observations

Implications

4. Initial Multiple Regression Models

Heating Load – Initial Model

Practical Meaning

Cooling Load – Initial Model

Key Coefficients

Practical Implications

Multicollinearity Assessment (VIF)

Multicollinearity Snapshot

What the Numbers Mean

Key Take-aways

5. Model Optimization, Refinement, and Validation

Model Optimization

Experimentation & Model Adjustment – Heating Load

Summary of Model Outputs

Initial Model

Optimized Model

ANOVA Comparison

Data Visualization for Heating_Load

Experimentation & Model Adjustment – Cooling_Load

Summary of Model Outputs – Cooling_Load

Initial Model

Optimized Model

ANOVA Comparison

Data Visualization for Cooling_Load

Model Refinement and Validation

Residual Analysis Heating_Load

Residual Analysis Cooling_Load

Diagnostic Tests

6. Summary of Modeling Energy Efficiency via Heating and Cooling Loads

Modeling Heating and Cooling Load Separately

Initial Predictor Selection Reasoning

Steps Undertaken to Refine the Model

Rationale for Predictor Changes

Optimal Model Explanation

Regression Equation and Metrics

7. Reflection on Optimization Process

8. References

Output from `glimpse()` Observations