In this data dive, I build a generalized linear model using logistic
regression to study which buildings are more likely to have high
greenhouse gas emissions. Since logistic regression requires a binary
response variable, I create a new variable called
high_emissions, where buildings with total greenhouse gas
emissions above the median are labeled as 1 and buildings at or below
the median are labeled as 0.
This is a meaningful variable to model because identifying high-emission buildings can help city planners, sustainability teams, and property managers prioritize buildings for energy-efficiency improvements and emissions reduction efforts.
library(tidyverse)
library(janitor)
energy <- read_csv("Building_Energy_Benchmarking_Data__2015-Present.csv", show_col_types = FALSE) %>%
clean_names()
glimpse(energy)
## Rows: 34,699
## Columns: 46
## $ ose_building_id <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13,…
## $ data_year <dbl> 2024, 2024, 2024, 2024, 2024, 202…
## $ building_name <chr> "MAYFLOWER PARK HOTEL", "PARAMOUN…
## $ building_type <chr> "NonResidential", "NonResidential…
## $ tax_parcel_identification_number <chr> "659000030", "659000220", "659000…
## $ address <chr> "405 OLIVE WAY", "724 PINE ST", "…
## $ city <chr> "SEATTLE", "SEATTLE", "SEATTLE", …
## $ state <chr> "WA", "WA", "WA", "WA", "WA", "WA…
## $ zip_code <dbl> 98101, 98101, 98101, 98101, 98121…
## $ latitude <dbl> 47.61220, 47.61307, 47.61367, 47.…
## $ longitude <dbl> -122.3380, -122.3336, -122.3382, …
## $ neighborhood <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN…
## $ council_district_code <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, …
## $ year_built <dbl> 1927, 1996, 1969, 1926, 1980, 199…
## $ numberof_floors <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15,…
## $ numberof_buildings <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ property_gfa_total <dbl> 88434, 103566, 956110, 61320, 175…
## $ property_gfa_buildings <dbl> 88434, 88502, 759392, 61320, 1135…
## $ property_gfa_parking <dbl> 0, 15064, 196718, 0, 62000, 37198…
## $ self_report_gfa_total <dbl> 115387, 103566, 947059, 61320, 20…
## $ self_report_gfa_buildings <dbl> 115387, 88502, 827566, 61320, 123…
## $ self_report_parking <dbl> 0, 15064, 119493, 0, 80497, 40971…
## $ energystar_score <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 5…
## $ site_euiwn_k_btu_sf <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168…
## $ site_eui_k_btu_sf <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167…
## $ site_energy_use_k_btu <dbl> 7113958, 6330664, 67613264, 52739…
## $ site_energy_use_wn_k_btu <dbl> 7172158, 6362478, 67852608, 53463…
## $ source_euiwn_k_btu_sf <dbl> 122.9, 128.7, 171.8, 174.7, 167.6…
## $ source_eui_k_btu_sf <dbl> 121.4, 128.3, 171.5, 171.4, 167.2…
## $ epa_property_type <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type_gfa <dbl> 115387, 88502, 827566, 61320, 123…
## $ second_largest_property_use_type <chr> NA, "Parking", "Parking", NA, "Pa…
## $ second_largest_property_use_type_gfa <dbl> NA, 15064, 117783, NA, 68009, 409…
## $ third_largest_property_use_type <chr> NA, NA, "Swimming Pool", NA, "Swi…
## $ third_largest_property_use_type_gfa <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricity_k_wh <dbl> 1045040, 787838, 11279080, 796976…
## $ steam_use_k_btu <dbl> 1949686, NA, 23256386, 1389935, N…
## $ natural_gas_therms <dbl> 15986, 36426, 58726, 11648, 73811…
## $ compliance_status <chr> "Not Compliant", "Compliant", "Co…
## $ compliance_issue <chr> "Default Data", "No Issue", "No I…
## $ electricity_k_btu <dbl> 3565676, 2688104, 38484221, 27192…
## $ natural_gas_k_btu <dbl> 1598590, 3642560, 5872650, 116476…
## $ total_ghg_emissions <dbl> 263.3, 208.6, 2418.2, 190.1, 417.…
## $ ghg_emissions_intensity <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.8…
## $ demolished <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
I created a binary variable called high_emissions. A
building is labeled as 1 if its total_ghg_emissions is
greater than the median and 0 otherwise.
energy <- energy %>%
mutate(
high_emissions = ifelse(
total_ghg_emissions > median(total_ghg_emissions, na.rm = TRUE),
1, 0
)
)
table(energy$high_emissions, useNA = "ifany")
##
## 0 1 <NA>
## 16941 16903 855
This variable is worth modeling because it separates buildings into lower-emission and higher-emission categories, which can be useful for identifying likely high-impact targets for sustainability efforts.
For the logistic regression model, I selected the following explanatory variables:
property_gfa_total: total gross floor areayear_built: the year the building was constructedbuilding_type: the category of buildingI selected these because building size, age, and type are all plausible factors that may influence whether a building has high greenhouse gas emissions.
analysis_df <- energy %>%
select(high_emissions, property_gfa_total, year_built, building_type) %>%
filter(
!is.na(high_emissions),
!is.na(property_gfa_total),
!is.na(year_built),
!is.na(building_type),
property_gfa_total > 0
) %>%
mutate(
high_emissions = factor(high_emissions),
building_type = factor(building_type)
)
summary(analysis_df)
## high_emissions property_gfa_total year_built building_type
## 0:16941 Min. : 18481 Min. :1900 NonResidential :13334
## 1:16903 1st Qu.: 29549 1st Qu.:1953 Multifamily LR (1-4):10250
## Median : 46720 Median :1979 Multifamily MR (5-9): 6799
## Mean : 107763 Mean :1972 Multifamily HR (10+): 1250
## 3rd Qu.: 98668 3rd Qu.:2001 SPS-District K-12 : 928
## Max. :15216474 Max. :2023 Nonresidential COS : 651
## (Other) : 632
ggplot(analysis_df, aes(x = building_type, fill = high_emissions)) +
geom_bar(position = "fill") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 35, hjust = 1)) +
labs(
title = "Proportion of High-Emission Buildings by Building Type",
x = "Building Type",
y = "Proportion",
fill = "High Emissions"
)
This plot visualizes how the proportion of high-emission buildings varies across building types, providing early evidence that emissions are not uniformly distributed and may be influenced by building category.
ggplot(analysis_df, aes(x = property_gfa_total, fill = high_emissions)) +
geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
scale_x_log10() +
theme_minimal() +
labs(
title = "Distribution of Gross Floor Area by Emissions Category",
x = "Property Gross Floor Area (log scale)",
y = "Count",
fill = "High Emissions"
)
This visualization reveals a pattern in which larger buildings are more frequently classified as high-emission, providing early evidence that building size may play a significant role in determining emissions outcomes.
Building a logistic regression model predicting whether a building has high emissions.
glm_model <- glm(
high_emissions ~ property_gfa_total + year_built + building_type,
data = analysis_df,
family = binomial
)
summary(glm_model)
##
## Call:
## glm(formula = high_emissions ~ property_gfa_total + year_built +
## building_type, family = binomial, data = analysis_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.997e+01 8.297e-01 24.072 < 2e-16 ***
## property_gfa_total 1.672e-05 3.150e-07 53.076 < 2e-16 ***
## year_built -9.765e-03 4.082e-04 -23.920 < 2e-16 ***
## building_typeMultifamily HR (10+) -1.420e+00 2.369e-01 -5.995 2.04e-09 ***
## building_typeMultifamily LR (1-4) -2.675e+00 2.208e-01 -12.117 < 2e-16 ***
## building_typeMultifamily MR (5-9) -1.845e+00 2.211e-01 -8.343 < 2e-16 ***
## building_typeNonResidential -1.688e+00 2.204e-01 -7.659 1.88e-14 ***
## building_typeNonresidential COS -4.025e-01 2.407e-01 -1.672 0.09455 .
## building_typeNonresidential WA -6.770e-01 3.106e-01 -2.180 0.02929 *
## building_typeSPS-District K-12 -6.781e-01 2.346e-01 -2.890 0.00385 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 46918 on 33843 degrees of freedom
## Residual deviance: 36052 on 33834 degrees of freedom
## AIC: 36072
##
## Number of Fisher Scoring iterations: 7
In logistic regression, the coefficients are in log-odds units.
building_type, each
coefficient is compared to a reference category.The coefficient for property_gfa_total tells us how the
log-odds of being a high-emission building change as building size
increases.
The coefficient for year_built tells us whether newer or
older buildings are more likely to be high-emission buildings.
The coefficients for building_type tell us how each
building category compares to the reference building type in terms of
the odds of being high-emission.
Odds ratios are easier to interpret than raw log-odds coefficients.
exp(coef(glm_model))
## (Intercept) property_gfa_total
## 4.720742e+08 1.000017e+00
## year_built building_typeMultifamily HR (10+)
## 9.902829e-01 2.416024e-01
## building_typeMultifamily LR (1-4) building_typeMultifamily MR (5-9)
## 6.887645e-02 1.580396e-01
## building_typeNonResidential building_typeNonresidential COS
## 1.849460e-01 6.686565e-01
## building_typeNonresidential WA building_typeSPS-District K-12
## 5.081610e-01 5.075959e-01
An odds ratio greater than 1 indicates increased odds of being a high-emission building, while an odds ratio less than 1 indicates decreased odds.
For example, if the odds ratio for property_gfa_total is
greater than 1, then larger buildings are associated with higher odds of
falling into the high-emissions category.
I will now construct a 95% confidence interval for the coefficient of
property_gfa_total using the standard error from the model
summary.
coef_summary <- summary(glm_model)$coefficients
coef_summary
## Estimate Std. Error z value
## (Intercept) 1.997265e+01 8.296909e-01 24.072394
## property_gfa_total 1.671738e-05 3.149693e-07 53.076218
## year_built -9.764584e-03 4.082180e-04 -23.920021
## building_typeMultifamily HR (10+) -1.420462e+00 2.369463e-01 -5.994869
## building_typeMultifamily LR (1-4) -2.675441e+00 2.207929e-01 -12.117421
## building_typeMultifamily MR (5-9) -1.844910e+00 2.211248e-01 -8.343295
## building_typeNonResidential -1.687692e+00 2.203677e-01 -7.658524
## building_typeNonresidential COS -4.024848e-01 2.407375e-01 -1.671882
## building_typeNonresidential WA -6.769569e-01 3.106004e-01 -2.179511
## building_typeSPS-District K-12 -6.780696e-01 2.346126e-01 -2.890168
## Pr(>|z|)
## (Intercept) 4.865817e-128
## property_gfa_total 0.000000e+00
## year_built 1.896029e-126
## building_typeMultifamily HR (10+) 2.036496e-09
## building_typeMultifamily LR (1-4) 8.540408e-34
## building_typeMultifamily MR (5-9) 7.224842e-17
## building_typeNonResidential 1.880827e-14
## building_typeNonresidential COS 9.454752e-02
## building_typeNonresidential WA 2.929375e-02
## building_typeSPS-District K-12 3.850365e-03
beta_hat <- coef_summary["property_gfa_total", "Estimate"]
se_beta <- coef_summary["property_gfa_total", "Std. Error"]
lower_ci <- beta_hat - 1.96 * se_beta
upper_ci <- beta_hat + 1.96 * se_beta
c(lower_ci, upper_ci)
## [1] 1.610004e-05 1.733472e-05
This confidence interval provides a plausible range for the true effect of gross floor area (property_gfa_total) on the log-odds of a building being classified as high-emission.
If the interval does not include 0, it indicates that the effect of building size is statistically significant at the 95% confidence level, meaning there is strong evidence that gross floor area is associated with the probability of being a high-emission building.
To make the interpretation more intuitive, I exponentiate the interval endpoints to obtain a confidence interval for the odds ratio, which describes how the odds of being a high-emission building change with increases in building size.
exp(c(lower_ci, upper_ci))
## [1] 1.000016 1.000017
This odds-ratio confidence interval tells us the range of multiplicative effects that a one-unit increase in gross floor area has on the odds of being a high-emission building.
I also generated predicted probabilities from the logistic regression model.
analysis_df$predicted_prob <- predict(glm_model, type = "response")
head(analysis_df)
## # A tibble: 6 Ă— 5
## high_emissions property_gfa_total year_built building_type predicted_prob
## <fct> <dbl> <dbl> <fct> <dbl>
## 1 1 88434 1927 NonResidential 0.721
## 2 1 103566 1996 NonResidential 0.629
## 3 1 956110 1969 NonResidential 1.000
## 4 1 61320 1926 NonResidential 0.623
## 5 1 175580 1980 NonResidential 0.868
## 6 1 97288 1999 Nonresidential COS 0.843
ggplot(analysis_df, aes(x = property_gfa_total, y = predicted_prob)) +
geom_point(alpha = 0.3) +
scale_x_log10() +
theme_minimal() +
labs(
title = "Predicted Probability of High Emissions vs Building Size",
x = "Property Gross Floor Area (log scale)",
y = "Predicted Probability of High Emissions"
)
This plot illustrates how the predicted probability of a building being classified as high-emission changes with building size, highlighting a clear upward trend as gross floor area increases.
The visualizations and logistic regression results together suggest that building characteristics such as size, age, and type play an important role in determining whether a building falls into the high-emissions category.
From the model and supporting plots, there is clear evidence that larger buildings are associated with a higher probability of being high-emission buildings. This is consistent with the expectation that larger buildings require more energy for heating, cooling, lighting, and operations. The visualization of predicted probabilities further reinforces this pattern, showing an upward trend as gross floor area increases.
Building type also appears to influence emissions behavior. The distribution plots indicate that some building categories have a higher proportion of high-emission buildings than others. This likely reflects differences in usage patterns, occupancy levels, and energy demands across building types such as offices, hospitals, or residential buildings.
These findings are significant because they suggest that emissions are not randomly distributed across buildings, but are instead influenced by structural and operational characteristics. As a result, targeted sustainability strategies—such as prioritizing larger buildings or focusing on specific building categories—may be more effective than applying uniform policies across all buildings.
This analysis raises several questions that could be explored in future work:
In this data dive, I constructed a binary response variable to classify buildings as high or low greenhouse gas emitters and applied a logistic regression model to examine how building characteristics influence this outcome.
The results demonstrate that variables such as building size, construction year, and building type are meaningfully associated with the probability of a building being classified as high-emission. In particular, the model highlights how structural and operational characteristics can help explain differences in emissions across buildings.
This analysis is significant because it moves beyond simple description and provides a probabilistic framework for identifying high-emission buildings. By estimating the likelihood of high emissions, the model offers practical value for decision-making, such as prioritizing buildings for energy efficiency improvements, policy interventions, or further investigation.
Overall, the generalized linear modeling approach provides a more nuanced understanding of emissions patterns and serves as a foundation for developing more targeted and effective sustainability strategies.