1 Introduction

In this data dive, I analyze the Seattle Building Energy Benchmarking dataset to study what factors may influence building emissions. For this assignment, the requirements are as following:

Choose a continuous response variable that is valuable in the context of the data.
Choose a categorical explanatory variable and use ANOVA to test whether group means differ.
Choose one continuous explanatory variable and build a simple linear regression model.
Interpret the findings, explain why they matter, and suggest further questions.

The response variable I selected is total_ghg_emissions, which represents the total greenhouse gas emissions produced by a building. I chose this variable because greenhouse gas emissions are highly meaningful in the context of sustainability, climate impact, and building performance. If a city, company, or property owner wants to improve environmental outcomes, this is one of the most important measures to examine.

2 Load Packages and Data

2.0.1 Load data

data_path <- "Building_Energy_Benchmarking_Data__2015-Present.csv"
if (!file.exists(data_path)) data_path <- file.choose()

data <- read_csv(data_path, show_col_types = FALSE) %>%
  clean_names()

glimpse(data)

## Rows: 34,699
## Columns: 46
## $ ose_building_id                      <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13,…
## $ data_year                            <dbl> 2024, 2024, 2024, 2024, 2024, 202…
## $ building_name                        <chr> "MAYFLOWER PARK HOTEL", "PARAMOUN…
## $ building_type                        <chr> "NonResidential", "NonResidential…
## $ tax_parcel_identification_number     <chr> "659000030", "659000220", "659000…
## $ address                              <chr> "405 OLIVE WAY", "724 PINE ST", "…
## $ city                                 <chr> "SEATTLE", "SEATTLE", "SEATTLE", …
## $ state                                <chr> "WA", "WA", "WA", "WA", "WA", "WA…
## $ zip_code                             <dbl> 98101, 98101, 98101, 98101, 98121…
## $ latitude                             <dbl> 47.61220, 47.61307, 47.61367, 47.…
## $ longitude                            <dbl> -122.3380, -122.3336, -122.3382, …
## $ neighborhood                         <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN…
## $ council_district_code                <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, …
## $ year_built                           <dbl> 1927, 1996, 1969, 1926, 1980, 199…
## $ numberof_floors                      <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15,…
## $ numberof_buildings                   <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ property_gfa_total                   <dbl> 88434, 103566, 956110, 61320, 175…
## $ property_gfa_buildings               <dbl> 88434, 88502, 759392, 61320, 1135…
## $ property_gfa_parking                 <dbl> 0, 15064, 196718, 0, 62000, 37198…
## $ self_report_gfa_total                <dbl> 115387, 103566, 947059, 61320, 20…
## $ self_report_gfa_buildings            <dbl> 115387, 88502, 827566, 61320, 123…
## $ self_report_parking                  <dbl> 0, 15064, 119493, 0, 80497, 40971…
## $ energystar_score                     <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 5…
## $ site_euiwn_k_btu_sf                  <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168…
## $ site_eui_k_btu_sf                    <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167…
## $ site_energy_use_k_btu                <dbl> 7113958, 6330664, 67613264, 52739…
## $ site_energy_use_wn_k_btu             <dbl> 7172158, 6362478, 67852608, 53463…
## $ source_euiwn_k_btu_sf                <dbl> 122.9, 128.7, 171.8, 174.7, 167.6…
## $ source_eui_k_btu_sf                  <dbl> 121.4, 128.3, 171.5, 171.4, 167.2…
## $ epa_property_type                    <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type            <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type_gfa        <dbl> 115387, 88502, 827566, 61320, 123…
## $ second_largest_property_use_type     <chr> NA, "Parking", "Parking", NA, "Pa…
## $ second_largest_property_use_type_gfa <dbl> NA, 15064, 117783, NA, 68009, 409…
## $ third_largest_property_use_type      <chr> NA, NA, "Swimming Pool", NA, "Swi…
## $ third_largest_property_use_type_gfa  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricity_k_wh                     <dbl> 1045040, 787838, 11279080, 796976…
## $ steam_use_k_btu                      <dbl> 1949686, NA, 23256386, 1389935, N…
## $ natural_gas_therms                   <dbl> 15986, 36426, 58726, 11648, 73811…
## $ compliance_status                    <chr> "Not Compliant", "Compliant", "Co…
## $ compliance_issue                     <chr> "Default Data", "No Issue", "No I…
## $ electricity_k_btu                    <dbl> 3565676, 2688104, 38484221, 27192…
## $ natural_gas_k_btu                    <dbl> 1598590, 3642560, 5872650, 116476…
## $ total_ghg_emissions                  <dbl> 263.3, 208.6, 2418.2, 190.1, 417.…
## $ ghg_emissions_intensity              <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.8…
## $ demolished                           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…

3 Data Preparation

I have used the following variables for this analysis:

total_ghg_emissions = response variable
building_type = categorical explanatory variable for ANOVA
property_gfa_total = continuous explanatory variable for regression

I also removed missing values and kept only valid positive square footage values. Because the dataset contains some extremely large outliers, I trim the top 1% of total_ghg_emissions and property_gfa_total so the plots and regression model better represent the general pattern among typical buildings.

analysis_df <- data %>%
  select(total_ghg_emissions, building_type, property_gfa_total) %>%
  filter(
    !is.na(total_ghg_emissions),
    !is.na(building_type),
    !is.na(property_gfa_total),
    property_gfa_total > 0,
    total_ghg_emissions >= 0
  )

q_ghg <- quantile(analysis_df$total_ghg_emissions, 0.99, na.rm = TRUE)
q_gfa <- quantile(analysis_df$property_gfa_total, 0.99, na.rm = TRUE)

analysis_df <- analysis_df %>%
  filter(
    total_ghg_emissions <= q_ghg,
    property_gfa_total <= q_gfa
  )

summary(analysis_df)

##  total_ghg_emissions building_type      property_gfa_total
##  Min.   :   0.10     Length:33282       Min.   : 18481    
##  1st Qu.:   7.70     Class :character   1st Qu.: 29380    
##  Median :  31.30     Mode  :character   Median : 45960    
##  Mean   :  81.29                        Mean   : 88622    
##  3rd Qu.:  88.10                        3rd Qu.: 94909    
##  Max.   :1427.80                        Max.   :872409

nrow(analysis_df)

## [1] 33282

4 Response Variable

My response variable is total_ghg_emissions.

This variable is valuable because it directly measures the environmental impact of a building. In a building energy dataset, people may be interested in energy use, efficiency, or carbon impact. Greenhouse gas emissions are especially important because they connect building operations to sustainability and climate-related decision-making.

5 Part 1: ANOVA

5.1 Categorical Explanatory Variable

For the ANOVA portion, I selected building_type as the categorical explanatory variable.

This is a good choice because different types of buildings operate differently. For example, a campus, school, office, or multifamily building may have very different usage patterns, equipment needs, occupancy schedules, and heating/cooling demands. Because of that, I expect building type to influence greenhouse gas emissions.

There are fewer than 10 categories in building_type, so I really do not need to consolidate categories before running ANOVA.

analysis_df %>%
  count(building_type, sort = TRUE)

## # A tibble: 8 × 2
##   building_type            n
##   <chr>                <int>
## 1 NonResidential       12995
## 2 Multifamily LR (1-4) 10249
## 3 Multifamily MR (5-9)  6799
## 4 Multifamily HR (10+)  1217
## 5 SPS-District K-12      928
## 6 Nonresidential COS     641
## 7 Campus                 295
## 8 Nonresidential WA      158

5.2 Null and Alternative Hypotheses

For the ANOVA test:

Null hypothesis (H0): The mean total_ghg_emissions is the same across all building types.
Alternative hypothesis (H1): At least one building type has a different mean total_ghg_emissions.

5.3 Visualize the Groups

p_box <- ggplot(analysis_df, aes(x = building_type, y = total_ghg_emissions, fill = building_type)) +
  geom_boxplot(alpha = 0.8) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1)) +
  labs(
    title = "Total GHG Emissions by Building Type",
    x = "Building Type",
    y = "Total GHG Emissions"
  ) +
  guides(fill = "none")

ggplotly(p_box)

The boxplot suggests that emissions differ across building types. Some categories appear to have much higher centers and wider spreads than others.

5.4 Run the ANOVA Test

anova_model <- aov(total_ghg_emissions ~ building_type, data = analysis_df)
summary(anova_model)

##                  Df    Sum Sq  Mean Sq F value Pr(>F)    
## building_type     7  84527529 12075361   701.9 <2e-16 ***
## Residuals     33274 572431870    17204                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.5 Group Means

analysis_df %>%
  group_by(building_type) %>%
  summarise(
    n = n(),
    mean_ghg = mean(total_ghg_emissions),
    median_ghg = median(total_ghg_emissions)
  ) %>%
  arrange(desc(mean_ghg))

## # A tibble: 8 × 4
##   building_type            n mean_ghg median_ghg
##   <chr>                <int>    <dbl>      <dbl>
## 1 Campus                 295    259.       147. 
## 2 Multifamily HR (10+)  1217    234.       160. 
## 3 Nonresidential WA      158    207.       132. 
## 4 Nonresidential COS     641    151.        99.6
## 5 NonResidential       12995    107.        43.5
## 6 SPS-District K-12      928    106.        67.2
## 7 Multifamily MR (5-9)  6799     65.8       36.5
## 8 Multifamily LR (1-4) 10249     26.9        7.7

5.6 ANOVA Interpretation

The ANOVA output tests whether average greenhouse gas emissions are the same across all building types. If the p-value is very small (less than 0.05), then I reject the null hypothesis.

Based on the ANOVA results, there is strong evidence that mean greenhouse gas emissions differ by building type. This means building type appears to matter when explaining how much a building emits.

From the summary table and boxplot, some building types tend to have much higher emissions than others. This makes sense in context because different building types have different sizes, equipment demands, hours of operation, and occupancy patterns.

5.6.1 Why this matters

This result is meaningful for city planners, sustainability teams, and property managers. It suggests that emissions reduction strategies should not assume that all buildings behave the same way. Different building categories may require different energy policies, incentives, or retrofit priorities.

For example, if one category consistently has higher average emissions, that category may be a stronger target for building upgrades, efficiency programs, or deeper investigation.

6 Part 2: Simple Linear Regression

6.1 Continuous Explanatory Variable

For the regression model, I selected property_gfa_total as the explanatory variable.

I chose this variable because building size is likely related to total greenhouse gas emissions. In general, larger buildings may require more lighting, heating, cooling, and equipment use, which can increase emissions. A roughly positive relationship is expected.

6.2 Visualize the Relationship

p <- ggplot(analysis_df,
            aes(x = property_gfa_total,
                y = total_ghg_emissions,
                color = building_type)) +
  geom_point(alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  theme_minimal() +
  labs(
    title = "Total GHG Emissions vs Gross Floor Area",
    x = "Property Gross Floor Area",
    y = "Total GHG Emissions",
    color = "Building Type"
  )
ggplotly(p)

The scatterplot shows a positive upward trend, which suggests that larger buildings tend to produce more greenhouse gas emissions. The pattern is not perfect, but it is reasonable enough for a simple linear regression model.

6.3 Build the Regression Model

lm_model <- lm(total_ghg_emissions ~ property_gfa_total, data = analysis_df)
summary(lm_model)

## 
## Call:
## lm(formula = total_ghg_emissions ~ property_gfa_total, data = analysis_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -581.84  -42.38  -29.39   10.49 1322.72 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.043e+01  8.343e-01   24.49   <2e-16 ***
## property_gfa_total 6.867e-04  5.908e-06  116.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 118.5 on 33280 degrees of freedom
## Multiple R-squared:  0.2887, Adjusted R-squared:  0.2887 
## F-statistic: 1.351e+04 on 1 and 33280 DF,  p-value: < 2.2e-16

6.4 Coefficients and Model Fit

tidy(lm_model)

## # A tibble: 2 × 5
##   term                estimate  std.error statistic   p.value
##   <chr>                  <dbl>      <dbl>     <dbl>     <dbl>
## 1 (Intercept)        20.4      0.834           24.5 2.53e-131
## 2 property_gfa_total  0.000687 0.00000591     116.  0

glance(lm_model)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df   logLik     AIC     BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>
## 1     0.289         0.289  118.    13510.       0     1 -206141. 412287. 412313.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

6.5 Regression Interpretation

The regression model estimates greenhouse gas emissions using gross floor area.

The regression equation is:

coef(lm_model)

##        (Intercept) property_gfa_total 
##         20.4339469          0.0006867

This can be written in the form:

[ = b_0 + b_1 x]

where:

() is the predicted total greenhouse gas emissions
(b_0) is the intercept
(b_1) is the slope for property_gfa_total

6.5.1 Interpreting the slope

The slope tells us how much the predicted greenhouse gas emissions increase for each one-unit increase in gross floor area.

Because the slope is positive, the model suggests that larger buildings tend to have higher total greenhouse gas emissions.

6.5.2 Interpreting the intercept

The intercept is the predicted emissions when gross floor area is 0. In practice, a building cannot really have 0 square feet, so the intercept is mainly part of the mathematical model and is not very meaningful on its own in this context.

6.5.3 Interpreting R-squared

The R-squared value tells us how much of the variation in greenhouse gas emissions is explained by gross floor area alone.

This value is helpful because it shows whether building size is a strong predictor by itself. Even if the relationship is statistically significant, a moderate or low R-squared would mean that other factors also play important roles, such as building type, age, equipment, usage patterns, or energy source.

7 What Insight Was Gathered?

This analysis produced two important insights.

7.1 Insight 1: Building type matters

The ANOVA results show that average greenhouse gas emissions differ across building types. This means emissions are not randomly distributed across categories. Instead, the type of building appears to influence environmental impact.

This is significant because it suggests that building-specific strategies may be more effective than one general solution for all properties.

7.2 Insight 2: Bigger buildings tend to emit more

The regression results show a positive relationship between gross floor area and greenhouse gas emissions. In other words, as building size increases, emissions tend to increase as well.

This is significant because it suggests that larger buildings may be good candidates for targeted sustainability efforts. If a city or organization wants to reduce emissions efficiently, it may make sense to focus on large buildings first.

8 Recommendations in Context

Based on this analysis, I would make the following practical recommendation:

Buildings should not all be treated the same when designing sustainability programs. Since emissions differ by building type, decision-makers should consider category-specific interventions. Also, because larger buildings tend to emit more, large properties may offer the greatest opportunity for emissions reductions through upgrades, retrofits, and operational improvements.

9 Further Questions

This analysis answered some useful questions, but it also raises new ones.

Some additional questions worth investigating are:

Does building age affect greenhouse gas emissions?
Do ENERGY STAR scores help explain emissions differences?
Would a multiple regression model using building size, building type, year built, and energy score perform much better than a simple model?
Are there particular neighborhoods with unusually high emissions even after accounting for building size?
Do some building categories have more variability because of differences in operating hours or equipment type?

These questions would help develop a more complete understanding of what drives building emissions.

10 Conclusion

In this data dive, I used the Seattle Building Energy Benchmarking dataset to study factors associated with total greenhouse gas emissions.

First, I used ANOVA to test whether emissions differed across building types. The results showed strong evidence that average emissions are not the same across all categories.

Second, I used simple linear regression to model emissions using gross floor area. The model showed a positive relationship, meaning larger buildings tend to produce more greenhouse gas emissions.

Overall, the results suggest that both building type and building size are useful for understanding emissions. This has real-world importance for environmental planning, energy policy, and building management.

Week 8 Data Dive: Regression Modeling

Divya Kapoor

2026-03-09