# Load libraries
library(tidyverse)
library(caret)
library(randomForest)
library(xgboost)Residential and Commercial Energy Cost Prediction Modelling
Load Library
Import Data
energy_consumption <- read.csv("~/R-Visualization/MINI_RESEARCH/Energy Consumption/energy_consumption.csv")Initial EDA
# View data structure and summary
str(energy_consumption)'data.frame': 5000 obs. of 6 variables:
$ customer_id : chr "CUSTOMER_0001" "CUSTOMER_0002" "CUSTOMER_0003" "CUSTOMER_0004" ...
$ customer_type : chr "residential" "commercial" "commercial" "residential" ...
$ regions : chr "Northeast" "Midwest" "Southeast" "Northeast" ...
$ building_size_m2: int 24 24 24 45 45 52 17 45 45 45 ...
$ occupants : int 2 1 1 4 4 2 3 3 2 4 ...
$ energy_cost_brl : num 64.5 55.3 74.5 147.1 143.1 ...
summary(energy_consumption) customer_id customer_type regions building_size_m2
Length:5000 Length:5000 Length:5000 Min. :17.00
Class :character Class :character Class :character 1st Qu.:24.00
Mode :character Mode :character Mode :character Median :45.00
Mean :39.58
3rd Qu.:45.00
Max. :77.00
occupants energy_cost_brl
Min. :1.000 Min. : 52.52
1st Qu.:1.000 1st Qu.: 68.56
Median :2.000 Median : 83.72
Mean :2.302 Mean : 86.87
3rd Qu.:3.000 3rd Qu.: 98.24
Max. :4.000 Max. :158.61
# Check missing values
colSums(is.na(energy_consumption)) # No missing values so we procced normally customer_id customer_type regions building_size_m2
0 0 0 0
occupants energy_cost_brl
0 0
# Distribution of energy cost
ggplot(energy_consumption, aes(x = energy_cost_brl)) +
geom_histogram(fill = "steelblue", bins = 30, alpha = 0.7) +
labs(title = "Distribution of Monthly Energy Costs", x = "Energy Cost (BRL)", y = "Count")# Average cost by customer type
ggplot(energy_consumption, aes(x = customer_type, y = energy_cost_brl, fill = customer_type)) +
geom_boxplot() +
labs(title = "Energy Cost by Customer Type", x = "", y = "Energy Cost (BRL)")# Relationship between building size and cost
ggplot(energy_consumption, aes(x = building_size_m2, y = energy_cost_brl, color = customer_type)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Energy Cost vs Building Size", x = "Building Size (m²)", y = "Energy Cost (BRL)")`geom_smooth()` using formula = 'y ~ x'
# Regional comparison
ggplot(energy_consumption, aes(x = regions, y = energy_cost_brl, fill = regions)) +
geom_boxplot() +
labs(title = "Regional Energy Cost Distribution", x = "Region", y = "Energy Cost (BRL)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Advanced Analysis
(Correlation Analysis)
building_size_m2 occupants energy_cost_brl
building_size_m2 1.0000000 0.1825219 0.1959211
occupants 0.1825219 1.0000000 0.5357968
energy_cost_brl 0.1959211 0.5357968 1.0000000
Descriptive Correlation Analysis
The Pearson correlation analysis (Table 1 below) reveals a moderate positive association between energy cost and both occupants (r = 0.54) and building size (r = 0.20). This suggests that energy expenditure tends to increase with both larger building sizes and higher occupancy levels. The relatively stronger correlation between energy cost and occupants indicates that household or building occupancy plays a more substantial role in determining monthly energy expenses.
| Variable | building_size_m2 | occupants | energy_cost_brl |
|---|---|---|---|
| building_size_m2 | 1.000 | 0.183 | 0.196 |
| occupants | 0.183 | 1.000 | 0.536 |
| energy_cost_brl | 0.196 | 0.536 | 1.000 |
Predictive Modeling
(Using linear regression to predict energy cost)
Call:
lm(formula = energy_cost_brl ~ building_size_m2 + occupants +
customer_type + regions, data = energy_consumption)
Residuals:
Min 1Q Median 3Q Max
-47.787 -14.767 -0.153 14.354 50.751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.62575 1.03114 51.036 <2e-16 ***
building_size_m2 0.14181 0.01682 8.432 <2e-16 ***
occupants 12.21506 0.28504 42.853 <2e-16 ***
customer_typeresidential -0.09863 0.60767 -0.162 0.871
regionsNorth 0.75953 0.87161 0.871 0.384
regionsNortheast 1.19526 0.73395 1.629 0.103
regionsSouth 0.27625 1.03730 0.266 0.790
regionsSoutheast 1.44400 1.03802 1.391 0.164
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.45 on 4992 degrees of freedom
Multiple R-squared: 0.2976, Adjusted R-squared: 0.2966
F-statistic: 302.1 on 7 and 4992 DF, p-value: < 2.2e-16
Model Fit
Multiple R² = 0.2976
About 29.8% of the variation in energy cost is explained by the predictors.Adjusted R² = 0.2966
Slight adjustment for the number of predictors; still moderately strong.F-statistic = 302.1, p < 2.2e-16
The overall model is highly significant; at least one predictor strongly influences energy cost.
Summary
Occupants are the most influential factor ; energy use increases sharply with the number of people using the property.
Building size also significantly affects cost but less strongly.
Customer type (residential vs commercial) shows no meaningful difference when building size and occupancy are controlled for suggesting usage patterns matter more than classification.
Regional differences are small and statistically insignificant, meaning energy cost patterns are relatively consistent across regions.
Model fit (R² ≈ 0.30) indicates a moderately predictive model, about 70% of energy cost variation might come from unobserved factors (e.g., appliance efficiency, weather, behavior).
Model Fit Visualization
ggplot(energy_consumption, aes(x = fitted(model), y = residuals(model))) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals")Model Diagnostics
par(mfrow = c(2,2))
plot(model)Interaction Effects for deeper insights
model2 <- lm(energy_cost_brl ~ building_size_m2 + occupants * customer_type + regions,
data = energy_consumption)
summary(model2)
Call:
lm(formula = energy_cost_brl ~ building_size_m2 + occupants *
customer_type + regions, data = energy_consumption)
Residuals:
Min 1Q Median 3Q Max
-48.102 -14.745 -0.137 14.363 52.102
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.59470 1.36349 40.041 <2e-16 ***
building_size_m2 0.14199 0.01681 8.446 <2e-16 ***
occupants 11.34834 0.48535 23.382 <2e-16 ***
customer_typeresidential -3.07993 1.48171 -2.079 0.0377 *
regionsNorth 0.69741 0.87173 0.800 0.4237
regionsNortheast 1.18556 0.73368 1.616 0.1062
regionsSouth 0.22172 1.03720 0.214 0.8307
regionsSoutheast 1.40599 1.03776 1.355 0.1755
occupants:customer_typeresidential 1.30780 0.59285 2.206 0.0274 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.44 on 4991 degrees of freedom
Multiple R-squared: 0.2983, Adjusted R-squared: 0.2971
F-statistic: 265.2 on 8 and 4991 DF, p-value: < 2.2e-16
| Variable | Estimate | p-value | Significance | Interpretation |
|---|---|---|---|---|
| (Intercept) | 54.59 | <0.001 | *** | Baseline monthly energy cost for commercial customers in the Midwest, with 0 occupants and baseline building size. |
| building_size_m2 | 0.142 | <0.001 | *** | Each additional m² increases energy cost by 0.14 BRL, controlling for other variables. |
| occupants | 11.35 | <0.001 | *** | For commercial properties, each extra occupant adds 11.35 BRL to energy cost. |
| customer_typeresidential | -3.08 | 0.0377 | * | Residential customers, on average, have 3.08 BRL lower base energy costs compared to commercial, when occupants = 0. |
| occupants × customer_typeresidential | 1.31 | 0.0274 | * | The effect of occupants on energy cost is 1.31 BRL higher in residential buildings than in commercial ones. |
| regions (all) | — | n.s. | No statistically significant regional differences. |
Table 2 Interaction Effect(deeper insights)
Interpretation the Interaction Effect
This are the key result:
Residential customers’ energy cost increases more sharply with each additional occupant than commercial ones.
For commercial customers: Effect of occupants=11.35 BRL per occupant = 11.35 , Effect of occupants=11.35BRL per occupant
For residential customers:
11.35+1.31=12.66 BRL per occupant11.35 + 1.31 = 12.66 \, \text{BRL per occupant}11.35+1.31=12.66BRL per occupant
- So, each extra occupant in a residential property adds ~12.7 BRL/month, while in a commercial property, it adds ~11.4 BRL/month.
This suggests that residential energy use is more sensitive to occupancy, possibly because home activities are more energy-intensive per person (e.g., appliances, lighting, air conditioning, etc.).
Model Fit
Multiple R²: 0.2983 (slightly improved from 0.2976)
Adjusted R²: 0.2971 Still a moderate fit; the new interaction improves explanatory power slightly.
F-statistic: 265.2 (p < 2.2e-16) Model remains highly significant overall.
Summary of Findings
Building size and occupancy remain the dominant drivers of energy cost.
Residential properties have slightly lower baseline costs than commercial ones.
However, as the number of occupants increases, residential costs rise more sharply, implying stronger person-level energy sensitivity in households.
Regional differences are not statistically significant.
Model improvement (ΔR² ≈ 0.001) shows that the interaction effect adds theoretically meaningful, though mild statistical, improvement. Maybe because the data is synthetic
Optional Visualization
ggplot(energy_consumption, aes(x = occupants, y = energy_cost_brl, color = customer_type)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Interaction Between Occupants and Customer Type on Energy Cost",
x = "Number of Occupants",
y = "Monthly Energy Cost (BRL)"
) +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
Results
Multiple Linear Regression (Model 1)
A multiple linear regression model was first fitted to examine the effects of building size, number of occupants, customer type, and region on monthly energy cost. The model was statistically significant overall (F(7, 4992) = 302.1, p < 0.001), explaining approximately 29.8% of the total variance in energy cost (Adjusted R² = 0.2966).
The results (Table 2) show that both building size (β = 0.142, p < 0.001) and occupants (β = 12.22, p < 0.001) were strong and significant predictors of energy cost. Each additional occupant increased the monthly energy cost by an average of 12.2 BRL, while each additional square meter of building size increased cost by about 0.14 BRL. Neither customer type nor regional location significantly predicted energy cost when controlling for these factors.
| Predictor | Estimate (β) | Std. Error | t-value | p-value | Interpretation |
|---|---|---|---|---|---|
| (Intercept) | 52.63 | 1.03 | 51.04 | <0.001 | Baseline energy cost |
| Building size (m²) | 0.142 | 0.017 | 8.43 | <0.001 | Larger buildings → higher cost |
| Occupants | 12.22 | 0.29 | 42.85 | <0.001 | More occupants → higher cost |
| Customer type (residential) | -0.099 | 0.61 | -0.16 | 0.871 | Not significant |
| Region (North, etc.) | — | — | — | >0.1 | No significant regional effect |
Table 3: Multiple linear regression model
These findings suggest that occupancy intensity and physical building size are the principal drivers of energy expenditure across both residential and commercial sectors.
Interaction Model (Model 2)
To further explore whether the relationship between occupancy and energy cost differs by customer type, an interaction term between occupants and customer type was introduced (Model 2). The model remained significant overall (F(8, 4991) = 265.2, p < 0.001), with a slight improvement in explanatory power (Adjusted R² = 0.2971).
Results (Table 4) below indicate that the interaction term (occupants × customer type) was statistically significant (β = 1.31, p = 0.027), suggesting that the effect of occupancy on energy cost depends on whether the building is residential or commercial.
| Predictor | Estimate (β) | Std. Error | t-value | p-value | Interpretation |
|---|---|---|---|---|---|
| (Intercept) | 54.59 | 1.36 | 40.04 | <0.001 | Baseline (commercial) |
| Building size (m²) | 0.142 | 0.017 | 8.45 | <0.001 | Larger buildings → higher cost |
| Occupants | 11.35 | 0.49 | 23.38 | <0.001 | Strong positive effect (commercial) |
| Customer type (residential) | -3.08 | 1.48 | -2.08 | 0.037 | Residential buildings have lower base cost |
| Occupants × Customer type | 1.31 | 0.59 | 2.21 | 0.027 | Occupant effect stronger in residential properties |
| Region (North, etc.) | — | — | — | >0.1 | No significant regional effect |
Table 4: Interaction Effect
Interpretation of the Interaction Effect
The interaction between occupants and customer type indicates that residential energy costs are more sensitive to changes in occupancy than commercial energy costs. Specifically, for commercial buildings, each additional occupant increases energy cost by approximately 11.35 BRL per month, whereas for residential buildings, the increase is 11.35 + 1.31 = 12.66 BRL per occupant.
This pattern suggests that residential buildings experience proportionally greater increases in energy use as more people occupy the premises, possibly due to higher per-person appliance usage, lighting, and cooling demands. Conversely, commercial buildings may benefit from shared energy usage efficiencies across occupants (e.g., centralized lighting or equipment).
Model Comparison
| Model | Description | Adjusted R² | F-statistic | p-value | Key Insight |
|---|---|---|---|---|---|
| Model 1 | Base model without interaction | 0.2966 | 302.1 | <0.001 | Building size and occupants are key predictors |
| Model 2 | Includes Occupants × Customer Type interaction | 0.2971 | 265.2 | <0.001 | Residential occupancy has a stronger effect |
Table 5: Model Comparison
Although the inclusion of the interaction term produced only a marginal increase in explanatory power (ΔAdjusted R² = 0.0005), it added theoretical insight by revealing distinct patterns of energy sensitivity between customer types.
Visualization
A scatterplot with regression lines illustrates the interaction between occupancy and customer type. The steeper slope for residential customers confirms that energy cost rises more sharply with additional occupants in residential settings compared to commercial ones.
ggplot(energy_consumption, aes(x = occupants, y = energy_cost_brl, color = customer_type)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Interaction Between Occupants and Customer Type on Energy Cost",
x = "Number of Occupants",
y = "Monthly Energy Cost (BRL)") +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
Summary
In summary, the analysis demonstrates that:
Energy costs are primarily driven by building size and number of occupants.
Residential buildings exhibit higher per-occupant energy sensitivity than commercial buildings.
Regional variation is negligible, implying consistent cost behavior across geographic zones.
The final model accounts for approximately 30% of the variation in monthly energy costs, suggesting that behavioral and technological factors (e.g., appliance efficiency, insulation, and consumption patterns) may explain the remaining variability.
Customer Segmentation (Clustering)
df_scaled <- scale(energy_consumption %>%
select(building_size_m2, occupants, energy_cost_brl))
kmeans_result <- kmeans(df_scaled, centers = 3)Testing nonlinear machine learning models to see if they improve predictive performance
- Random Forest (nonlinear, tree-based ensemble)
- XGBoost (gradient boosting algorithm)
# Ensure categorical variables are factors
energy_consumption <- energy_consumption %>%
mutate(
customer_type = as.factor(customer_type),
regions = as.factor(regions)
)
# Partition data (80% training, 20% testing)
set.seed(123)
train_index <- createDataPartition(energy_consumption$energy_cost_brl, p = 0.8, list = FALSE)
train_data <- energy_consumption[train_index, ]
test_data <- energy_consumption[-train_index, ]Random Forest Model
# Train Random Forest model
set.seed(123)
rf_model <- randomForest(
energy_cost_brl ~ building_size_m2 + occupants + customer_type + regions,
data = train_data,
ntree = 500,
mtry = 3,
importance = TRUE
)
# Evaluate model performance
rf_pred <- predict(rf_model, newdata = test_data)
rf_rmse <- sqrt(mean((rf_pred - test_data$energy_cost_brl)^2))
rf_r2 <- cor(rf_pred, test_data$energy_cost_brl)^2
cat("Random Forest RMSE:", rf_rmse, "\n")Random Forest RMSE: 15.62949
cat("Random Forest R²:", rf_r2, "\n")Random Forest R²: 0.5631287
# Variable importance plot
varImpPlot(rf_model, main = "Variable Importance - Random Forest")XGBoost Model
# Prepare matrices for xgboost
train_matrix <- model.matrix(energy_cost_brl ~ building_size_m2 + occupants + customer_type + regions, data = train_data)
test_matrix <- model.matrix(energy_cost_brl ~ building_size_m2 + occupants + customer_type + regions, data = test_data)
train_label <- train_data$energy_cost_brl
test_label <- test_data$energy_cost_brl
# Convert to xgb.DMatrix
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
# Train XGBoost model
set.seed(123)
xgb_model <- xgboost(
data = dtrain,
nrounds = 200,
objective = "reg:squarederror",
eta = 0.1,
max_depth = 4,
subsample = 0.8,
colsample_bytree = 0.8,
verbose = 0
)
# Predictions and performance
xgb_pred <- predict(xgb_model, dtest)
xgb_rmse <- sqrt(mean((xgb_pred - test_label)^2))
xgb_r2 <- cor(xgb_pred, test_label)^2
cat("XGBoost RMSE:", xgb_rmse, "\n")XGBoost RMSE: 15.5571
cat("XGBoost R²:", xgb_r2, "\n")XGBoost R²: 0.5664746
Results and Discussion
Model Performance Overview
Three models were developed to predict monthly energy cost (energy_cost_brl) using building characteristics, occupancy, customer type, and regional information: a multiple linear regression (MLR), a Random Forest (RF) model, and an Extreme Gradient Boosting (XGBoost) model.
Model performance was evaluated using the Root Mean Square Error (RMSE) and the Coefficient of Determination (R²) as shown in Table 6 below.
| Model | RMSE | R² | Interpretation |
|---|---|---|---|
| Linear Regression | — | 0.298 | Baseline linear model; moderate fit |
| Random Forest | 15.63 | 0.563 | Captures nonlinear and interaction effects |
| XGBoost | 15.56 | 0.566 | Highest predictive accuracy among models |
Table 6 Model performance comparison for predicting energy cost (BRL)
Model Interpretation
The multiple linear regression model explained approximately 30% of the variance in monthly energy costs (R² = 0.2976). Among all predictors, the number of occupants and building size were statistically significant (p < 0.001), indicating that energy expenditure tends to rise with increased occupancy and larger building areas. However, customer type and regional differences were not statistically significant, suggesting a uniform energy cost structure across regions when controlling for household size and building size.
The Random Forest model substantially improved predictive performance (R² = 0.56), indicating that nearly 56% of the variation in energy cost could be explained by the model. This improvement reflects the ability of tree-based methods to capture nonlinear relationships and complex interactions between predictors.
The XGBoost model performed slightly better (R² = 0.566, RMSE = 15.56), confirming that boosting can enhance predictive precision by sequentially minimizing residual errors from weaker models. The marginal improvement over Random Forest suggests both models effectively capture similar nonlinearities, but XGBoost fine-tunes the relationships more efficiently.
Variable Importance
Variable importance analysis from the Random Forest model (Figure 8) revealed that occupants and building size (m²) were the most influential predictors of energy cost.
customer_type and regions had relatively minor contributions, implying that occupancy intensity (i.e., how many people share the same building space) is the dominant driver of energy consumption, followed by the physical scale of the property.
varImpPlot(rf_model, main = "Figure 8: Variable Importance - Random Forest Model") Interpretation:
Buildings with more occupants show higher energy costs due to increased use of appliances, lighting, and cooling systems. Similarly, larger buildings tend to consume more energy due to greater area coverage and equipment requirements. The weak influence of regional differences might be due to similar climatic and pricing conditions across the regions in the dataset. Considering this dataset is synthetic that’s not very surprising observation.
Discussion
The findings highlight the significance of household size and building area as primary determinants of energy expenditure. Nonlinear machine learning models outperform traditional linear regression, indicating that energy consumption behavior is not strictly linear but influenced by complex interaction such as combinations of occupant density and building size.
The results align with literature emphasizing the role of household composition and buildingcharacteristics in energy demand modeling (e.g., [Li et al., 2022]; [Ozturk et al., 2021]). The high R² values from Random Forest and XGBoost models also suggest potential applications in energy cost forecasting, customer segmentation, and targeted energy-efficiency interventions.
Reference
Li, X., Yao, R., & Wang, J. (2022). Influence of household demographics and building characteristics on residential energy consumption: A comprehensive analysis. Energy and Buildings, 262, 111998.
https://doi.org/10.1016/j.enbuild.2022.111998
Ozturk, M., Aydin, E., & Esen, Ö. (2021). Determinants of household energy consumption in emerging economies: Evidence from micro-level data. Energy, 214, 118858.
https://doi.org/10.1016/j.energy.2020.118858
Data Source: https://www.kaggle.com/datasets/andreylss/residential-and-commercial-energy-cost-dataset
About Dataset
This dataset contains synthetic data representing energy consumption patterns for 5,000 customers across different regions. The data includes both residential and commercial properties, with information about building characteristics, occupancy, and monthly energy costs.