1 Introduction

In this data dive, I build a generalized linear model using logistic regression to study which buildings are more likely to have high greenhouse gas emissions. Since logistic regression requires a binary response variable, I create a new variable called high_emissions, where buildings with total greenhouse gas emissions above the median are labeled as 1 and buildings at or below the median are labeled as 0.

This is a meaningful variable to model because identifying high-emission buildings can help city planners, sustainability teams, and property managers prioritize buildings for energy-efficiency improvements and emissions reduction efforts.

2 Load Packages and Data

library(tidyverse)
library(janitor)

energy <- read_csv("Building_Energy_Benchmarking_Data__2015-Present.csv", show_col_types = FALSE) %>%
  clean_names()

glimpse(energy)
## Rows: 34,699
## Columns: 46
## $ ose_building_id                      <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13,…
## $ data_year                            <dbl> 2024, 2024, 2024, 2024, 2024, 202…
## $ building_name                        <chr> "MAYFLOWER PARK HOTEL", "PARAMOUN…
## $ building_type                        <chr> "NonResidential", "NonResidential…
## $ tax_parcel_identification_number     <chr> "659000030", "659000220", "659000…
## $ address                              <chr> "405 OLIVE WAY", "724 PINE ST", "…
## $ city                                 <chr> "SEATTLE", "SEATTLE", "SEATTLE", …
## $ state                                <chr> "WA", "WA", "WA", "WA", "WA", "WA…
## $ zip_code                             <dbl> 98101, 98101, 98101, 98101, 98121…
## $ latitude                             <dbl> 47.61220, 47.61307, 47.61367, 47.…
## $ longitude                            <dbl> -122.3380, -122.3336, -122.3382, …
## $ neighborhood                         <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN…
## $ council_district_code                <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, …
## $ year_built                           <dbl> 1927, 1996, 1969, 1926, 1980, 199…
## $ numberof_floors                      <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15,…
## $ numberof_buildings                   <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ property_gfa_total                   <dbl> 88434, 103566, 956110, 61320, 175…
## $ property_gfa_buildings               <dbl> 88434, 88502, 759392, 61320, 1135…
## $ property_gfa_parking                 <dbl> 0, 15064, 196718, 0, 62000, 37198…
## $ self_report_gfa_total                <dbl> 115387, 103566, 947059, 61320, 20…
## $ self_report_gfa_buildings            <dbl> 115387, 88502, 827566, 61320, 123…
## $ self_report_parking                  <dbl> 0, 15064, 119493, 0, 80497, 40971…
## $ energystar_score                     <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 5…
## $ site_euiwn_k_btu_sf                  <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168…
## $ site_eui_k_btu_sf                    <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167…
## $ site_energy_use_k_btu                <dbl> 7113958, 6330664, 67613264, 52739…
## $ site_energy_use_wn_k_btu             <dbl> 7172158, 6362478, 67852608, 53463…
## $ source_euiwn_k_btu_sf                <dbl> 122.9, 128.7, 171.8, 174.7, 167.6…
## $ source_eui_k_btu_sf                  <dbl> 121.4, 128.3, 171.5, 171.4, 167.2…
## $ epa_property_type                    <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type            <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type_gfa        <dbl> 115387, 88502, 827566, 61320, 123…
## $ second_largest_property_use_type     <chr> NA, "Parking", "Parking", NA, "Pa…
## $ second_largest_property_use_type_gfa <dbl> NA, 15064, 117783, NA, 68009, 409…
## $ third_largest_property_use_type      <chr> NA, NA, "Swimming Pool", NA, "Swi…
## $ third_largest_property_use_type_gfa  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricity_k_wh                     <dbl> 1045040, 787838, 11279080, 796976…
## $ steam_use_k_btu                      <dbl> 1949686, NA, 23256386, 1389935, N…
## $ natural_gas_therms                   <dbl> 15986, 36426, 58726, 11648, 73811…
## $ compliance_status                    <chr> "Not Compliant", "Compliant", "Co…
## $ compliance_issue                     <chr> "Default Data", "No Issue", "No I…
## $ electricity_k_btu                    <dbl> 3565676, 2688104, 38484221, 27192…
## $ natural_gas_k_btu                    <dbl> 1598590, 3642560, 5872650, 116476…
## $ total_ghg_emissions                  <dbl> 263.3, 208.6, 2418.2, 190.1, 417.…
## $ ghg_emissions_intensity              <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.8…
## $ demolished                           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…

3 Create a Binary Response Variable

I created a binary variable called high_emissions. A building is labeled as 1 if its total_ghg_emissions is greater than the median and 0 otherwise.

energy <- energy %>%
  mutate(
    high_emissions = ifelse(
      total_ghg_emissions > median(total_ghg_emissions, na.rm = TRUE),
      1, 0
    )
  )

table(energy$high_emissions, useNA = "ifany")
## 
##     0     1  <NA> 
## 16941 16903   855

This variable is worth modeling because it separates buildings into lower-emission and higher-emission categories, which can be useful for identifying likely high-impact targets for sustainability efforts.

4 Select Explanatory Variables and Clean Data

For the logistic regression model, I selected the following explanatory variables:

I selected these because building size, age, and type are all plausible factors that may influence whether a building has high greenhouse gas emissions.

analysis_df <- energy %>%
  select(high_emissions, property_gfa_total, year_built, building_type) %>%
  filter(
    !is.na(high_emissions),
    !is.na(property_gfa_total),
    !is.na(year_built),
    !is.na(building_type),
    property_gfa_total > 0
  ) %>%
  mutate(
    high_emissions = factor(high_emissions),
    building_type = factor(building_type)
  )

summary(analysis_df)
##  high_emissions property_gfa_total   year_built                building_type  
##  0:16941        Min.   :   18481   Min.   :1900   NonResidential      :13334  
##  1:16903        1st Qu.:   29549   1st Qu.:1953   Multifamily LR (1-4):10250  
##                 Median :   46720   Median :1979   Multifamily MR (5-9): 6799  
##                 Mean   :  107763   Mean   :1972   Multifamily HR (10+): 1250  
##                 3rd Qu.:   98668   3rd Qu.:2001   SPS-District K-12   :  928  
##                 Max.   :15216474   Max.   :2023   Nonresidential COS  :  651  
##                                                   (Other)             :  632

5 Exploratory Visualizations

5.1 High Emissions by Building Type

ggplot(analysis_df, aes(x = building_type, fill = high_emissions)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1)) +
  labs(
    title = "Proportion of High-Emission Buildings by Building Type",
    x = "Building Type",
    y = "Proportion",
    fill = "High Emissions"
  )

This plot visualizes how the proportion of high-emission buildings varies across building types, providing early evidence that emissions are not uniformly distributed and may be influenced by building category.

5.2 Building Size vs High Emissions

ggplot(analysis_df, aes(x = property_gfa_total, fill = high_emissions)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
  scale_x_log10() +
  theme_minimal() +
  labs(
    title = "Distribution of Gross Floor Area by Emissions Category",
    x = "Property Gross Floor Area (log scale)",
    y = "Count",
    fill = "High Emissions"
  )

This visualization reveals a pattern in which larger buildings are more frequently classified as high-emission, providing early evidence that building size may play a significant role in determining emissions outcomes.

6 Build the Logistic Regression Model

Building a logistic regression model predicting whether a building has high emissions.

glm_model <- glm(
  high_emissions ~ property_gfa_total + year_built + building_type,
  data = analysis_df,
  family = binomial
)

summary(glm_model)
## 
## Call:
## glm(formula = high_emissions ~ property_gfa_total + year_built + 
##     building_type, family = binomial, data = analysis_df)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        1.997e+01  8.297e-01  24.072  < 2e-16 ***
## property_gfa_total                 1.672e-05  3.150e-07  53.076  < 2e-16 ***
## year_built                        -9.765e-03  4.082e-04 -23.920  < 2e-16 ***
## building_typeMultifamily HR (10+) -1.420e+00  2.369e-01  -5.995 2.04e-09 ***
## building_typeMultifamily LR (1-4) -2.675e+00  2.208e-01 -12.117  < 2e-16 ***
## building_typeMultifamily MR (5-9) -1.845e+00  2.211e-01  -8.343  < 2e-16 ***
## building_typeNonResidential       -1.688e+00  2.204e-01  -7.659 1.88e-14 ***
## building_typeNonresidential COS   -4.025e-01  2.407e-01  -1.672  0.09455 .  
## building_typeNonresidential WA    -6.770e-01  3.106e-01  -2.180  0.02929 *  
## building_typeSPS-District K-12    -6.781e-01  2.346e-01  -2.890  0.00385 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 46918  on 33843  degrees of freedom
## Residual deviance: 36052  on 33834  degrees of freedom
## AIC: 36072
## 
## Number of Fisher Scoring iterations: 7

7 Interpreting the Coefficients

In logistic regression, the coefficients are in log-odds units.

The coefficient for property_gfa_total tells us how the log-odds of being a high-emission building change as building size increases.

The coefficient for year_built tells us whether newer or older buildings are more likely to be high-emission buildings.

The coefficients for building_type tell us how each building category compares to the reference building type in terms of the odds of being high-emission.

8 Convert Coefficients to Odds Ratios

Odds ratios are easier to interpret than raw log-odds coefficients.

exp(coef(glm_model))
##                       (Intercept)                property_gfa_total 
##                      4.720742e+08                      1.000017e+00 
##                        year_built building_typeMultifamily HR (10+) 
##                      9.902829e-01                      2.416024e-01 
## building_typeMultifamily LR (1-4) building_typeMultifamily MR (5-9) 
##                      6.887645e-02                      1.580396e-01 
##       building_typeNonResidential   building_typeNonresidential COS 
##                      1.849460e-01                      6.686565e-01 
##    building_typeNonresidential WA    building_typeSPS-District K-12 
##                      5.081610e-01                      5.075959e-01

An odds ratio greater than 1 indicates increased odds of being a high-emission building, while an odds ratio less than 1 indicates decreased odds.

For example, if the odds ratio for property_gfa_total is greater than 1, then larger buildings are associated with higher odds of falling into the high-emissions category.

9 Confidence Interval for a Coefficient

I will now construct a 95% confidence interval for the coefficient of property_gfa_total using the standard error from the model summary.

coef_summary <- summary(glm_model)$coefficients
coef_summary
##                                        Estimate   Std. Error    z value
## (Intercept)                        1.997265e+01 8.296909e-01  24.072394
## property_gfa_total                 1.671738e-05 3.149693e-07  53.076218
## year_built                        -9.764584e-03 4.082180e-04 -23.920021
## building_typeMultifamily HR (10+) -1.420462e+00 2.369463e-01  -5.994869
## building_typeMultifamily LR (1-4) -2.675441e+00 2.207929e-01 -12.117421
## building_typeMultifamily MR (5-9) -1.844910e+00 2.211248e-01  -8.343295
## building_typeNonResidential       -1.687692e+00 2.203677e-01  -7.658524
## building_typeNonresidential COS   -4.024848e-01 2.407375e-01  -1.671882
## building_typeNonresidential WA    -6.769569e-01 3.106004e-01  -2.179511
## building_typeSPS-District K-12    -6.780696e-01 2.346126e-01  -2.890168
##                                        Pr(>|z|)
## (Intercept)                       4.865817e-128
## property_gfa_total                 0.000000e+00
## year_built                        1.896029e-126
## building_typeMultifamily HR (10+)  2.036496e-09
## building_typeMultifamily LR (1-4)  8.540408e-34
## building_typeMultifamily MR (5-9)  7.224842e-17
## building_typeNonResidential        1.880827e-14
## building_typeNonresidential COS    9.454752e-02
## building_typeNonresidential WA     2.929375e-02
## building_typeSPS-District K-12     3.850365e-03
beta_hat <- coef_summary["property_gfa_total", "Estimate"]
se_beta <- coef_summary["property_gfa_total", "Std. Error"]

lower_ci <- beta_hat - 1.96 * se_beta
upper_ci <- beta_hat + 1.96 * se_beta

c(lower_ci, upper_ci)
## [1] 1.610004e-05 1.733472e-05

This confidence interval provides a plausible range for the true effect of gross floor area (property_gfa_total) on the log-odds of a building being classified as high-emission.

If the interval does not include 0, it indicates that the effect of building size is statistically significant at the 95% confidence level, meaning there is strong evidence that gross floor area is associated with the probability of being a high-emission building.

To make the interpretation more intuitive, I exponentiate the interval endpoints to obtain a confidence interval for the odds ratio, which describes how the odds of being a high-emission building change with increases in building size.

exp(c(lower_ci, upper_ci))
## [1] 1.000016 1.000017

This odds-ratio confidence interval tells us the range of multiplicative effects that a one-unit increase in gross floor area has on the odds of being a high-emission building.

10 Predicted Probabilities

I also generated predicted probabilities from the logistic regression model.

analysis_df$predicted_prob <- predict(glm_model, type = "response")

head(analysis_df)
## # A tibble: 6 Ă— 5
##   high_emissions property_gfa_total year_built building_type      predicted_prob
##   <fct>                       <dbl>      <dbl> <fct>                       <dbl>
## 1 1                           88434       1927 NonResidential              0.721
## 2 1                          103566       1996 NonResidential              0.629
## 3 1                          956110       1969 NonResidential              1.000
## 4 1                           61320       1926 NonResidential              0.623
## 5 1                          175580       1980 NonResidential              0.868
## 6 1                           97288       1999 Nonresidential COS          0.843

10.1 Visualize Predicted Probability vs Building Size

ggplot(analysis_df, aes(x = property_gfa_total, y = predicted_prob)) +
  geom_point(alpha = 0.3) +
  scale_x_log10() +
  theme_minimal() +
  labs(
    title = "Predicted Probability of High Emissions vs Building Size",
    x = "Property Gross Floor Area (log scale)",
    y = "Predicted Probability of High Emissions"
  )

This plot illustrates how the predicted probability of a building being classified as high-emission changes with building size, highlighting a clear upward trend as gross floor area increases.

11 Insight and Significance

The visualizations and logistic regression results together suggest that building characteristics such as size, age, and type play an important role in determining whether a building falls into the high-emissions category.

From the model and supporting plots, there is clear evidence that larger buildings are associated with a higher probability of being high-emission buildings. This is consistent with the expectation that larger buildings require more energy for heating, cooling, lighting, and operations. The visualization of predicted probabilities further reinforces this pattern, showing an upward trend as gross floor area increases.

Building type also appears to influence emissions behavior. The distribution plots indicate that some building categories have a higher proportion of high-emission buildings than others. This likely reflects differences in usage patterns, occupancy levels, and energy demands across building types such as offices, hospitals, or residential buildings.

These findings are significant because they suggest that emissions are not randomly distributed across buildings, but are instead influenced by structural and operational characteristics. As a result, targeted sustainability strategies—such as prioritizing larger buildings or focusing on specific building categories—may be more effective than applying uniform policies across all buildings.

12 Further Questions

This analysis raises several questions that could be explored in future work:

  1. Would adding more variables, such as ENERGY STAR score or neighborhood, improve the model?
  2. Are there interactions between building type and building size?
  3. Would a different binary cutoff for high emissions change the conclusions?
  4. How well does the model classify buildings overall?

13 Conclusion

In this data dive, I constructed a binary response variable to classify buildings as high or low greenhouse gas emitters and applied a logistic regression model to examine how building characteristics influence this outcome.

The results demonstrate that variables such as building size, construction year, and building type are meaningfully associated with the probability of a building being classified as high-emission. In particular, the model highlights how structural and operational characteristics can help explain differences in emissions across buildings.

This analysis is significant because it moves beyond simple description and provides a probabilistic framework for identifying high-emission buildings. By estimating the likelihood of high emissions, the model offers practical value for decision-making, such as prioritizing buildings for energy efficiency improvements, policy interventions, or further investigation.

Overall, the generalized linear modeling approach provides a more nuanced understanding of emissions patterns and serves as a foundation for developing more targeted and effective sustainability strategies.