In this data dive, I extended the simple linear regression model from last week by adding additional predictors and then evaluating whether the updated model satisfies the assumptions required for valid regression inference.
The assignment asks for a model with 2–4 terms, along with a discussion of why each variable should or should not be included, whether multi-colinearity is present, and what the diagnostic plots reveal about the quality of the model. My goal here is not only to find a useful model, but also to determine whether the model is trustworthy enough to interpret.
energy <- read_csv("Building_Energy_Benchmarking_Data__2015-Present.csv")
glimpse(energy)
## Rows: 34,699
## Columns: 46
## $ OSEBuildingID <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13, 15, …
## $ DataYear <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 20…
## $ BuildingName <chr> "MAYFLOWER PARK HOTEL", "PARAMOUNT HOT…
## $ BuildingType <chr> "NonResidential", "NonResidential", "N…
## $ TaxParcelIdentificationNumber <chr> "659000030", "659000220", "659000475",…
## $ Address <chr> "405 OLIVE WAY", "724 PINE ST", "1900 …
## $ City <chr> "SEATTLE", "SEATTLE", "SEATTLE", "SEAT…
## $ State <chr> "WA", "WA", "WA", "WA", "WA", "WA", "W…
## $ ZipCode <dbl> 98101, 98101, 98101, 98101, 98121, 981…
## $ Latitude <dbl> 47.61220, 47.61307, 47.61367, 47.61412…
## $ Longitude <dbl> -122.3380, -122.3336, -122.3382, -122.…
## $ Neighborhood <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN", "D…
## $ CouncilDistrictCode <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, 7, 7,…
## $ YearBuilt <dbl> 1927, 1996, 1969, 1926, 1980, 1999, 19…
## $ NumberofFloors <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15, 6, 1…
## $ NumberofBuildings <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ PropertyGFATotal <dbl> 88434, 103566, 956110, 61320, 175580, …
## $ PropertyGFABuildings <dbl> 88434, 88502, 759392, 61320, 113580, 6…
## $ PropertyGFAParking <dbl> 0, 15064, 196718, 0, 62000, 37198, 0, …
## $ SelfReportGFATotal <dbl> 115387, 103566, 947059, 61320, 203942,…
## $ SelfReportGFABuildings <dbl> 115387, 88502, 827566, 61320, 123445, …
## $ SelfReportParking <dbl> 0, 15064, 119493, 0, 80497, 40971, 0, …
## $ ENERGYSTARScore <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 58, 60…
## $ `SiteEUIWN(kBtu/sf)` <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168.5, 1…
## $ `SiteEUI(kBtu/sf)` <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167.8, 1…
## $ `SiteEnergyUse(kBtu)` <dbl> 7113958, 6330664, 67613264, 5273981, 1…
## $ `SiteEnergyUseWN(kBtu)` <dbl> 7172158, 6362478, 67852608, 5346375, 1…
## $ `SourceEUIWN(kBtu/sf)` <dbl> 122.9, 128.7, 171.8, 174.7, 167.6, 379…
## $ `SourceEUI(kBtu/sf)` <dbl> 121.4, 128.3, 171.5, 171.4, 167.2, 379…
## $ EPAPropertyType <chr> "Hotel", "Hotel", "Hotel", "Hotel", "H…
## $ LargestPropertyUseType <chr> "Hotel", "Hotel", "Hotel", "Hotel", "H…
## $ LargestPropertyUseTypeGFA <dbl> 115387, 88502, 827566, 61320, 123445, …
## $ SecondLargestPropertyUseType <chr> NA, "Parking", "Parking", NA, "Parking…
## $ SecondLargestPropertyUseTypeGFA <dbl> NA, 15064, 117783, NA, 68009, 40971, N…
## $ ThirdLargestPropertyUseType <chr> NA, NA, "Swimming Pool", NA, "Swimming…
## $ ThirdLargestPropertyUseTypeGFA <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 55…
## $ `Electricity(kWh)` <dbl> 1045040, 787838, 11279080, 796976, 134…
## $ `SteamUse(kBtu)` <dbl> 1949686, NA, 23256386, 1389935, NA, NA…
## $ `NaturalGas(therms)` <dbl> 15986, 36426, 58726, 11648, 73811, 262…
## $ ComplianceStatus <chr> "Not Compliant", "Compliant", "Complia…
## $ ComplianceIssue <chr> "Default Data", "No Issue", "No Issue"…
## $ `Electricity(kBtu)` <dbl> 3565676, 2688104, 38484221, 2719282, 4…
## $ `NaturalGas(kBtu)` <dbl> 1598590, 3642560, 5872650, 1164760, 73…
## $ TotalGHGEmissions <dbl> 263.3, 208.6, 2418.2, 190.1, 417.9, 17…
## $ GHGEmissionsIntensity <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.88, 3.…
## $ Demolished <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
For this analysis, I used total greenhouse gas emissions as the response variable. I also selected a small set of predictors that are both interpretable and likely to affect emissions:
I also created a log-transformed size variable because building area is typically right-skewed, and a binary office indicator to keep the categorical structure simple.
analysis_df <- energy %>%
transmute(
total_ghg_emissions = TotalGHGEmissions,
property_gfa_total = PropertyGFATotal,
energy_star_score = ENERGYSTARScore,
largest_property_use_type = LargestPropertyUseType
) %>%
filter(
!is.na(total_ghg_emissions),
!is.na(property_gfa_total),
!is.na(energy_star_score),
!is.na(largest_property_use_type),
total_ghg_emissions > 0,
property_gfa_total > 0
) %>%
mutate(
log_total_ghg = log(total_ghg_emissions),
log_gfa = log(property_gfa_total),
office_building = if_else(
str_detect(str_to_lower(largest_property_use_type), "office"),
1, 0
)
)
analysis_df %>%
summarize(
n = n(),
min_ghg = min(total_ghg_emissions),
max_ghg = max(total_ghg_emissions),
min_gfa = min(property_gfa_total),
max_gfa = max(property_gfa_total)
) %>%
kable()
| n | min_ghg | max_ghg | min_gfa | max_gfa |
|---|---|---|---|---|
| 25405 | 0.1 | 475355.3 | 19488 | 2080885 |
log_gfaI included building size because larger buildings generally require more energy and therefore tend to produce more emissions. I used the logarithm of square footage instead of the raw value because the floor area variable is heavily right-skewed. The log transformation reduces the influence of extremely large buildings and makes the relationship more linear.
energy_star_scoreI included ENERGY STAR score because it is a direct measure of building efficiency. In theory, buildings with higher energy efficiency scores should have lower emissions, after accounting for size. This makes it a meaningful predictor from both a substantive and modeling perspective.
office_buildingI created a binary variable for office buildings versus non-office buildings. This satisfies the assignment requirement to try a binary term and gives the model a simple way to capture broad differences in how building use may affect emissions.
log_gfa:office_buildingI included an interaction between building size and office status. This allows the relationship between floor area and emissions to differ between office and non-office buildings. Without this interaction, the model would assume the same size-emissions relationship across all building uses.
p1 <- ggplot(analysis_df, aes(x = property_gfa_total)) +
geom_histogram(bins = 30) +
labs(title = "Distribution of PropertyGFATotal")
p2 <- ggplot(analysis_df, aes(x = log_gfa, y = log_total_ghg)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Log GFA vs. Log Total GHG Emissions")
p3 <- ggplot(analysis_df, aes(x = factor(office_building), y = log_total_ghg)) +
geom_boxplot() +
labs(
title = "Log Total GHG by Office Indicator",
x = "Office building (0 = No, 1 = Yes)"
)
p1
p2
p3
These initial plots support the decision to use the transformed size variable. The relationship between raw square footage and emissions is highly skewed, while the log-log relationship appears more linear and easier to model. The office indicator may also matter, although the difference is not extreme enough to assume a large effect without fitting the model.
I used a model with four terms: two main effects, one binary effect, and one interaction term.
model <- lm(
log_total_ghg ~ log_gfa + energy_star_score + office_building + log_gfa:office_building,
data = analysis_df
)
summary(model)
##
## Call:
## lm(formula = log_total_ghg ~ log_gfa + energy_star_score + office_building +
## log_gfa:office_building, data = analysis_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3419 -0.9075 -0.0172 0.8482 6.8635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.2450786 0.1009890 -91.55 <2e-16 ***
## log_gfa 1.2557864 0.0091453 137.31 <2e-16 ***
## energy_star_score -0.0164447 0.0002692 -61.10 <2e-16 ***
## office_building 3.0206749 0.1958957 15.42 <2e-16 ***
## log_gfa:office_building -0.2835562 0.0171607 -16.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.082 on 25400 degrees of freedom
## Multiple R-squared: 0.5059, Adjusted R-squared: 0.5058
## F-statistic: 6502 on 4 and 25400 DF, p-value: < 2.2e-16
Before interpreting the model, I checked variance inflation factors (VIFs) to see whether the predictors are too strongly correlated with one another.
vif_values <- vif(model)
vif_values
## log_gfa energy_star_score office_building
## 1.483696 1.020365 117.793164
## log_gfa:office_building
## 121.319609
Interpretation:
As a rule of thumb, VIF values around 1 indicate little to no
multi-colinearity, values above 5 suggest a moderate concern, and values
above 10 indicate a serious problem. If the VIF values here remain
comfortably low, then multi-colinearity is not a major reason to exclude
any of the chosen variables.
The next step is to evaluate whether the assumptions of linear regression are reasonably satisfied. I use five diagnostic views:
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
cooks <- cooks.distance(model)
plot(
cooks,
type = "h",
main = "Cook's Distance",
ylab = "Cook's Distance",
xlab = "Observation Index"
)
abline(h = 4 / nrow(analysis_df), lty = 2)
The Residuals vs Fitted plot was used to assess whether the linearity assumption is reasonable and whether residuals are centered around zero. In this model, the residuals appear mostly scattered around the horizontal line at zero, suggesting that the linear relationship is generally appropriate. However, there is a slight pattern in the spread of residuals, indicating mild heteroscedasticity. This suggests that while the model captures the overall trend, variability increases slightly for larger fitted values.
The Normal Q-Q plot was used to evaluate whether the residuals follow a normal distribution. Most of the points lie close to the reference line, indicating that the normality assumption is reasonably satisfied. There are small deviations at the tails, which is expected in real-world data and does not appear severe enough to invalidate the model.
The Scale-Location plot was used to check for constant variance of residuals. The points show a slight upward trend, indicating that the spread of residuals increases with fitted values. This suggests mild heteroscedasticity, meaning the variance is not perfectly constant. While this is not severe, it indicates that model predictions may be less reliable at higher fitted values.
The Residuals vs Leverage plot was examined to identify potentially influential observations. A few points exhibit moderate leverage, but none appear to exceed Cook’s distance thresholds. This suggests that no single observation is exerting excessive influence on the model estimates.
Cook’s Distance was used to further assess the influence of individual observations. Most observations have very low Cook’s distance values, indicating limited influence. There are no clear outliers with disproportionately large values, suggesting that the model is not overly dependent on any single data point.
Overall, the diagnostic plots suggest that the model assumptions are reasonably satisfied, though mild heteroscedasticity remains. This indicates that while the model is valid for interpretation, further improvements such as transformation or robust regression could enhance performance.
model_glance <- glance(model)
model_glance %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, AIC, BIC) %>%
kable(digits = 3)
| r.squared | adj.r.squared | sigma | statistic | p.value | AIC | BIC |
|---|---|---|---|---|---|---|
| 0.506 | 0.506 | 1.082 | 6501.82 | 0 | 76124.57 | 76173.43 |
This model is useful only if the diagnostics support its assumptions. A statistically significant model with poor diagnostics can still produce misleading conclusions, so the diagnostic review is just as important as the coefficient table.
tidy(model, conf.int = TRUE) %>%
kable(digits = 4)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -9.2451 | 0.1010 | -91.5454 | 0 | -9.4430 | -9.0471 |
| log_gfa | 1.2558 | 0.0091 | 137.3145 | 0 | 1.2379 | 1.2737 |
| energy_star_score | -0.0164 | 0.0003 | -61.0980 | 0 | -0.0170 | -0.0159 |
| office_building | 3.0207 | 0.1959 | 15.4198 | 0 | 2.6367 | 3.4046 |
| log_gfa:office_building | -0.2836 | 0.0172 | -16.5236 | 0 | -0.3172 | -0.2499 |
Because the response variable is log-transformed, the coefficients are interpreted in terms of proportional (percentage) changes in emissions rather than absolute changes.
log_gfa: represents the percentage change in emissions associated with a proportional increase in building size, holding other variables constant. Larger buildings are expected to have higher emissions, though the relationship is modeled on a log scale.
energy_star_score: captures how building efficiency is associated with emissions. A higher energy score is expected to correspond to lower emissions, holding building size and type constant.
office_building: represents the difference in emissions between office and non-office buildings, controlling for size and efficiency.
log_gfa:office_building: indicates whether the relationship between building size and emissions differs for office buildings compared to non-office buildings. This term allows the slope of building size to vary by building type.
Some useful next questions would be:
Overall, the expanded model improves on a one-variable regression by incorporating building size, efficiency, and building-use type. These variables are substantively meaningful and provide a more realistic explanation of the factors driving building emissions. In particular, the inclusion of an interaction term allows the relationship between building size and emissions to vary across building types, which better reflects real-world differences.
The diagnostic analysis provides important context for evaluating the validity of the model. The residual plots suggest that the core assumptions are reasonably satisfied, though there is evidence of mild heteroscedasticity. While this does not invalidate the model, it indicates that variability in emissions increases for larger fitted values, and predictions may be less precise in those regions.
A key takeaway from this analysis is that regression should not stop at coefficient interpretation. Even a statistically significant model can produce misleading conclusions if its assumptions are violated. Diagnostic evaluation is therefore essential to determine whether the model is reliable enough to support interpretation.
One remaining question is whether additional variables—such as building age, occupancy patterns, or energy usage characteristics—could further improve model performance and reduce unexplained variation. Exploring these factors would be a valuable next step for strengthening the analysis.