Overview

In this data dive, I extended the simple linear regression model from last week by adding additional predictors and then evaluating whether the updated model satisfies the assumptions required for valid regression inference.

The assignment asks for a model with 2–4 terms, along with a discussion of why each variable should or should not be included, whether multi-colinearity is present, and what the diagnostic plots reveal about the quality of the model. My goal here is not only to find a useful model, but also to determine whether the model is trustworthy enough to interpret.

Data Import and Cleaning

energy <- read_csv("Building_Energy_Benchmarking_Data__2015-Present.csv")
glimpse(energy)
## Rows: 34,699
## Columns: 46
## $ OSEBuildingID                   <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13, 15, …
## $ DataYear                        <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 20…
## $ BuildingName                    <chr> "MAYFLOWER PARK HOTEL", "PARAMOUNT HOT…
## $ BuildingType                    <chr> "NonResidential", "NonResidential", "N…
## $ TaxParcelIdentificationNumber   <chr> "659000030", "659000220", "659000475",…
## $ Address                         <chr> "405 OLIVE WAY", "724 PINE ST", "1900 …
## $ City                            <chr> "SEATTLE", "SEATTLE", "SEATTLE", "SEAT…
## $ State                           <chr> "WA", "WA", "WA", "WA", "WA", "WA", "W…
## $ ZipCode                         <dbl> 98101, 98101, 98101, 98101, 98121, 981…
## $ Latitude                        <dbl> 47.61220, 47.61307, 47.61367, 47.61412…
## $ Longitude                       <dbl> -122.3380, -122.3336, -122.3382, -122.…
## $ Neighborhood                    <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN", "D…
## $ CouncilDistrictCode             <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, 7, 7,…
## $ YearBuilt                       <dbl> 1927, 1996, 1969, 1926, 1980, 1999, 19…
## $ NumberofFloors                  <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15, 6, 1…
## $ NumberofBuildings               <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ PropertyGFATotal                <dbl> 88434, 103566, 956110, 61320, 175580, …
## $ PropertyGFABuildings            <dbl> 88434, 88502, 759392, 61320, 113580, 6…
## $ PropertyGFAParking              <dbl> 0, 15064, 196718, 0, 62000, 37198, 0, …
## $ SelfReportGFATotal              <dbl> 115387, 103566, 947059, 61320, 203942,…
## $ SelfReportGFABuildings          <dbl> 115387, 88502, 827566, 61320, 123445, …
## $ SelfReportParking               <dbl> 0, 15064, 119493, 0, 80497, 40971, 0, …
## $ ENERGYSTARScore                 <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 58, 60…
## $ `SiteEUIWN(kBtu/sf)`            <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168.5, 1…
## $ `SiteEUI(kBtu/sf)`              <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167.8, 1…
## $ `SiteEnergyUse(kBtu)`           <dbl> 7113958, 6330664, 67613264, 5273981, 1…
## $ `SiteEnergyUseWN(kBtu)`         <dbl> 7172158, 6362478, 67852608, 5346375, 1…
## $ `SourceEUIWN(kBtu/sf)`          <dbl> 122.9, 128.7, 171.8, 174.7, 167.6, 379…
## $ `SourceEUI(kBtu/sf)`            <dbl> 121.4, 128.3, 171.5, 171.4, 167.2, 379…
## $ EPAPropertyType                 <chr> "Hotel", "Hotel", "Hotel", "Hotel", "H…
## $ LargestPropertyUseType          <chr> "Hotel", "Hotel", "Hotel", "Hotel", "H…
## $ LargestPropertyUseTypeGFA       <dbl> 115387, 88502, 827566, 61320, 123445, …
## $ SecondLargestPropertyUseType    <chr> NA, "Parking", "Parking", NA, "Parking…
## $ SecondLargestPropertyUseTypeGFA <dbl> NA, 15064, 117783, NA, 68009, 40971, N…
## $ ThirdLargestPropertyUseType     <chr> NA, NA, "Swimming Pool", NA, "Swimming…
## $ ThirdLargestPropertyUseTypeGFA  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 55…
## $ `Electricity(kWh)`              <dbl> 1045040, 787838, 11279080, 796976, 134…
## $ `SteamUse(kBtu)`                <dbl> 1949686, NA, 23256386, 1389935, NA, NA…
## $ `NaturalGas(therms)`            <dbl> 15986, 36426, 58726, 11648, 73811, 262…
## $ ComplianceStatus                <chr> "Not Compliant", "Compliant", "Complia…
## $ ComplianceIssue                 <chr> "Default Data", "No Issue", "No Issue"…
## $ `Electricity(kBtu)`             <dbl> 3565676, 2688104, 38484221, 2719282, 4…
## $ `NaturalGas(kBtu)`              <dbl> 1598590, 3642560, 5872650, 1164760, 73…
## $ TotalGHGEmissions               <dbl> 263.3, 208.6, 2418.2, 190.1, 417.9, 17…
## $ GHGEmissionsIntensity           <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.88, 3.…
## $ Demolished                      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

For this analysis, I used total greenhouse gas emissions as the response variable. I also selected a small set of predictors that are both interpretable and likely to affect emissions:

I also created a log-transformed size variable because building area is typically right-skewed, and a binary office indicator to keep the categorical structure simple.

analysis_df <- energy %>%
  transmute(
    total_ghg_emissions = TotalGHGEmissions,
    property_gfa_total = PropertyGFATotal,
    energy_star_score = ENERGYSTARScore,
    largest_property_use_type = LargestPropertyUseType
  ) %>%
  filter(
    !is.na(total_ghg_emissions),
    !is.na(property_gfa_total),
    !is.na(energy_star_score),
    !is.na(largest_property_use_type),
    total_ghg_emissions > 0,
    property_gfa_total > 0
  ) %>%
  mutate(
    log_total_ghg = log(total_ghg_emissions),
    log_gfa = log(property_gfa_total),
    office_building = if_else(
      str_detect(str_to_lower(largest_property_use_type), "office"),
      1, 0
    )
  )

analysis_df %>%
  summarize(
    n = n(),
    min_ghg = min(total_ghg_emissions),
    max_ghg = max(total_ghg_emissions),
    min_gfa = min(property_gfa_total),
    max_gfa = max(property_gfa_total)
  ) %>%
  kable()
n min_ghg max_ghg min_gfa max_gfa
25405 0.1 475355.3 19488 2080885

Why These Variables Were Considered

1. log_gfa

I included building size because larger buildings generally require more energy and therefore tend to produce more emissions. I used the logarithm of square footage instead of the raw value because the floor area variable is heavily right-skewed. The log transformation reduces the influence of extremely large buildings and makes the relationship more linear.

2. energy_star_score

I included ENERGY STAR score because it is a direct measure of building efficiency. In theory, buildings with higher energy efficiency scores should have lower emissions, after accounting for size. This makes it a meaningful predictor from both a substantive and modeling perspective.

3. office_building

I created a binary variable for office buildings versus non-office buildings. This satisfies the assignment requirement to try a binary term and gives the model a simple way to capture broad differences in how building use may affect emissions.

4. log_gfa:office_building

I included an interaction between building size and office status. This allows the relationship between floor area and emissions to differ between office and non-office buildings. Without this interaction, the model would assume the same size-emissions relationship across all building uses.

Quick Exploratory Checks

p1 <- ggplot(analysis_df, aes(x = property_gfa_total)) +
  geom_histogram(bins = 30) +
  labs(title = "Distribution of PropertyGFATotal")

p2 <- ggplot(analysis_df, aes(x = log_gfa, y = log_total_ghg)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Log GFA vs. Log Total GHG Emissions")

p3 <- ggplot(analysis_df, aes(x = factor(office_building), y = log_total_ghg)) +
  geom_boxplot() +
  labs(
    title = "Log Total GHG by Office Indicator",
    x = "Office building (0 = No, 1 = Yes)"
  )

p1

p2

p3

These initial plots support the decision to use the transformed size variable. The relationship between raw square footage and emissions is highly skewed, while the log-log relationship appears more linear and easier to model. The office indicator may also matter, although the difference is not extreme enough to assume a large effect without fitting the model.

Fit the Regression Model

I used a model with four terms: two main effects, one binary effect, and one interaction term.

model <- lm(
  log_total_ghg ~ log_gfa + energy_star_score + office_building + log_gfa:office_building,
  data = analysis_df
)

summary(model)
## 
## Call:
## lm(formula = log_total_ghg ~ log_gfa + energy_star_score + office_building + 
##     log_gfa:office_building, data = analysis_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3419 -0.9075 -0.0172  0.8482  6.8635 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -9.2450786  0.1009890  -91.55   <2e-16 ***
## log_gfa                  1.2557864  0.0091453  137.31   <2e-16 ***
## energy_star_score       -0.0164447  0.0002692  -61.10   <2e-16 ***
## office_building          3.0206749  0.1958957   15.42   <2e-16 ***
## log_gfa:office_building -0.2835562  0.0171607  -16.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.082 on 25400 degrees of freedom
## Multiple R-squared:  0.5059, Adjusted R-squared:  0.5058 
## F-statistic:  6502 on 4 and 25400 DF,  p-value: < 2.2e-16

Multicollinearity Check

Before interpreting the model, I checked variance inflation factors (VIFs) to see whether the predictors are too strongly correlated with one another.

vif_values <- vif(model)
vif_values
##                 log_gfa       energy_star_score         office_building 
##                1.483696                1.020365              117.793164 
## log_gfa:office_building 
##              121.319609

Interpretation:
As a rule of thumb, VIF values around 1 indicate little to no multi-colinearity, values above 5 suggest a moderate concern, and values above 10 indicate a serious problem. If the VIF values here remain comfortably low, then multi-colinearity is not a major reason to exclude any of the chosen variables.

Model Diagnostics

The next step is to evaluate whether the assumptions of linear regression are reasonably satisfied. I use five diagnostic views:

  1. Residuals vs Fitted
  2. Normal Q-Q
  3. Scale-Location
  4. Residuals vs Leverage
  5. Cook’s Distance

1–4. Standard Diagnostic Plots

par(mfrow = c(2, 2))
plot(model)

par(mfrow = c(1, 1))

5. Cook’s Distance Plot

cooks <- cooks.distance(model)

plot(
  cooks,
  type = "h",
  main = "Cook's Distance",
  ylab = "Cook's Distance",
  xlab = "Observation Index"
)
abline(h = 4 / nrow(analysis_df), lty = 2)

Diagnostic Interpretation

Residuals vs Fitted

The Residuals vs Fitted plot was used to assess whether the linearity assumption is reasonable and whether residuals are centered around zero. In this model, the residuals appear mostly scattered around the horizontal line at zero, suggesting that the linear relationship is generally appropriate. However, there is a slight pattern in the spread of residuals, indicating mild heteroscedasticity. This suggests that while the model captures the overall trend, variability increases slightly for larger fitted values.

Normal Q-Q

The Normal Q-Q plot was used to evaluate whether the residuals follow a normal distribution. Most of the points lie close to the reference line, indicating that the normality assumption is reasonably satisfied. There are small deviations at the tails, which is expected in real-world data and does not appear severe enough to invalidate the model.

Scale-Location

The Scale-Location plot was used to check for constant variance of residuals. The points show a slight upward trend, indicating that the spread of residuals increases with fitted values. This suggests mild heteroscedasticity, meaning the variance is not perfectly constant. While this is not severe, it indicates that model predictions may be less reliable at higher fitted values.

Residuals vs Leverage

The Residuals vs Leverage plot was examined to identify potentially influential observations. A few points exhibit moderate leverage, but none appear to exceed Cook’s distance thresholds. This suggests that no single observation is exerting excessive influence on the model estimates.

Cook’s Distance

Cook’s Distance was used to further assess the influence of individual observations. Most observations have very low Cook’s distance values, indicating limited influence. There are no clear outliers with disproportionately large values, suggesting that the model is not overly dependent on any single data point.

Overall, the diagnostic plots suggest that the model assumptions are reasonably satisfied, though mild heteroscedasticity remains. This indicates that while the model is valid for interpretation, further improvements such as transformation or robust regression could enhance performance.

Overall Model Evaluation

model_glance <- glance(model)
model_glance %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, AIC, BIC) %>%
  kable(digits = 3)
r.squared adj.r.squared sigma statistic p.value AIC BIC
0.506 0.506 1.082 6501.82 0 76124.57 76173.43

This model is useful only if the diagnostics support its assumptions. A statistically significant model with poor diagnostics can still produce misleading conclusions, so the diagnostic review is just as important as the coefficient table.

Interpretation of Coefficients

tidy(model, conf.int = TRUE) %>%
  kable(digits = 4)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -9.2451 0.1010 -91.5454 0 -9.4430 -9.0471
log_gfa 1.2558 0.0091 137.3145 0 1.2379 1.2737
energy_star_score -0.0164 0.0003 -61.0980 0 -0.0170 -0.0159
office_building 3.0207 0.1959 15.4198 0 2.6367 3.4046
log_gfa:office_building -0.2836 0.0172 -16.5236 0 -0.3172 -0.2499

Because the response variable is log-transformed, the coefficients are interpreted in terms of proportional (percentage) changes in emissions rather than absolute changes.

Questions for Further Investigation

Some useful next questions would be:

  1. Would a more detailed building-type categorization improve the model?
  2. Would adding variables such as year built or number of floors explain more of the variation?
  3. If heteroscedasticity remains, would robust standard errors or a different response transformation improve the model?
  4. Are there specific influential buildings that should be investigated individually?

Final Reflection

Overall, the expanded model improves on a one-variable regression by incorporating building size, efficiency, and building-use type. These variables are substantively meaningful and provide a more realistic explanation of the factors driving building emissions. In particular, the inclusion of an interaction term allows the relationship between building size and emissions to vary across building types, which better reflects real-world differences.

The diagnostic analysis provides important context for evaluating the validity of the model. The residual plots suggest that the core assumptions are reasonably satisfied, though there is evidence of mild heteroscedasticity. While this does not invalidate the model, it indicates that variability in emissions increases for larger fitted values, and predictions may be less precise in those regions.

A key takeaway from this analysis is that regression should not stop at coefficient interpretation. Even a statistically significant model can produce misleading conclusions if its assumptions are violated. Diagnostic evaluation is therefore essential to determine whether the model is reliable enough to support interpretation.

One remaining question is whether additional variables—such as building age, occupancy patterns, or energy usage characteristics—could further improve model performance and reduce unexplained variation. Exploring these factors would be a valuable next step for strengthening the analysis.