In this data dive, I analyze the Seattle Building Energy Benchmarking dataset to study what factors may influence building emissions. For this assignment, the requirements are as following:
The response variable I selected is
total_ghg_emissions, which represents the
total greenhouse gas emissions produced by a building. I chose this
variable because greenhouse gas emissions are highly meaningful in the
context of sustainability, climate impact, and building performance. If
a city, company, or property owner wants to improve environmental
outcomes, this is one of the most important measures to examine.
data_path <- "Building_Energy_Benchmarking_Data__2015-Present.csv"
if (!file.exists(data_path)) data_path <- file.choose()
data <- read_csv(data_path, show_col_types = FALSE) %>%
clean_names()
glimpse(data)
## Rows: 34,699
## Columns: 46
## $ ose_building_id <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13,…
## $ data_year <dbl> 2024, 2024, 2024, 2024, 2024, 202…
## $ building_name <chr> "MAYFLOWER PARK HOTEL", "PARAMOUN…
## $ building_type <chr> "NonResidential", "NonResidential…
## $ tax_parcel_identification_number <chr> "659000030", "659000220", "659000…
## $ address <chr> "405 OLIVE WAY", "724 PINE ST", "…
## $ city <chr> "SEATTLE", "SEATTLE", "SEATTLE", …
## $ state <chr> "WA", "WA", "WA", "WA", "WA", "WA…
## $ zip_code <dbl> 98101, 98101, 98101, 98101, 98121…
## $ latitude <dbl> 47.61220, 47.61307, 47.61367, 47.…
## $ longitude <dbl> -122.3380, -122.3336, -122.3382, …
## $ neighborhood <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN…
## $ council_district_code <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, …
## $ year_built <dbl> 1927, 1996, 1969, 1926, 1980, 199…
## $ numberof_floors <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15,…
## $ numberof_buildings <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ property_gfa_total <dbl> 88434, 103566, 956110, 61320, 175…
## $ property_gfa_buildings <dbl> 88434, 88502, 759392, 61320, 1135…
## $ property_gfa_parking <dbl> 0, 15064, 196718, 0, 62000, 37198…
## $ self_report_gfa_total <dbl> 115387, 103566, 947059, 61320, 20…
## $ self_report_gfa_buildings <dbl> 115387, 88502, 827566, 61320, 123…
## $ self_report_parking <dbl> 0, 15064, 119493, 0, 80497, 40971…
## $ energystar_score <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 5…
## $ site_euiwn_k_btu_sf <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168…
## $ site_eui_k_btu_sf <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167…
## $ site_energy_use_k_btu <dbl> 7113958, 6330664, 67613264, 52739…
## $ site_energy_use_wn_k_btu <dbl> 7172158, 6362478, 67852608, 53463…
## $ source_euiwn_k_btu_sf <dbl> 122.9, 128.7, 171.8, 174.7, 167.6…
## $ source_eui_k_btu_sf <dbl> 121.4, 128.3, 171.5, 171.4, 167.2…
## $ epa_property_type <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type_gfa <dbl> 115387, 88502, 827566, 61320, 123…
## $ second_largest_property_use_type <chr> NA, "Parking", "Parking", NA, "Pa…
## $ second_largest_property_use_type_gfa <dbl> NA, 15064, 117783, NA, 68009, 409…
## $ third_largest_property_use_type <chr> NA, NA, "Swimming Pool", NA, "Swi…
## $ third_largest_property_use_type_gfa <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricity_k_wh <dbl> 1045040, 787838, 11279080, 796976…
## $ steam_use_k_btu <dbl> 1949686, NA, 23256386, 1389935, N…
## $ natural_gas_therms <dbl> 15986, 36426, 58726, 11648, 73811…
## $ compliance_status <chr> "Not Compliant", "Compliant", "Co…
## $ compliance_issue <chr> "Default Data", "No Issue", "No I…
## $ electricity_k_btu <dbl> 3565676, 2688104, 38484221, 27192…
## $ natural_gas_k_btu <dbl> 1598590, 3642560, 5872650, 116476…
## $ total_ghg_emissions <dbl> 263.3, 208.6, 2418.2, 190.1, 417.…
## $ ghg_emissions_intensity <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.8…
## $ demolished <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
I have used the following variables for this analysis:
total_ghg_emissions = response variablebuilding_type = categorical explanatory variable for
ANOVAproperty_gfa_total = continuous explanatory variable
for regressionI also removed missing values and kept only valid positive square
footage values. Because the dataset contains some extremely large
outliers, I trim the top 1% of total_ghg_emissions and
property_gfa_total so the plots and regression model better
represent the general pattern among typical buildings.
analysis_df <- data %>%
select(total_ghg_emissions, building_type, property_gfa_total) %>%
filter(
!is.na(total_ghg_emissions),
!is.na(building_type),
!is.na(property_gfa_total),
property_gfa_total > 0,
total_ghg_emissions >= 0
)
q_ghg <- quantile(analysis_df$total_ghg_emissions, 0.99, na.rm = TRUE)
q_gfa <- quantile(analysis_df$property_gfa_total, 0.99, na.rm = TRUE)
analysis_df <- analysis_df %>%
filter(
total_ghg_emissions <= q_ghg,
property_gfa_total <= q_gfa
)
summary(analysis_df)
## total_ghg_emissions building_type property_gfa_total
## Min. : 0.10 Length:33282 Min. : 18481
## 1st Qu.: 7.70 Class :character 1st Qu.: 29380
## Median : 31.30 Mode :character Median : 45960
## Mean : 81.29 Mean : 88622
## 3rd Qu.: 88.10 3rd Qu.: 94909
## Max. :1427.80 Max. :872409
nrow(analysis_df)
## [1] 33282
My response variable is
total_ghg_emissions.
This variable is valuable because it directly measures the environmental impact of a building. In a building energy dataset, people may be interested in energy use, efficiency, or carbon impact. Greenhouse gas emissions are especially important because they connect building operations to sustainability and climate-related decision-making.
For the ANOVA portion, I selected
building_type as the categorical
explanatory variable.
This is a good choice because different types of buildings operate differently. For example, a campus, school, office, or multifamily building may have very different usage patterns, equipment needs, occupancy schedules, and heating/cooling demands. Because of that, I expect building type to influence greenhouse gas emissions.
There are fewer than 10 categories in building_type, so
I really do not need to consolidate categories before running ANOVA.
analysis_df %>%
count(building_type, sort = TRUE)
## # A tibble: 8 × 2
## building_type n
## <chr> <int>
## 1 NonResidential 12995
## 2 Multifamily LR (1-4) 10249
## 3 Multifamily MR (5-9) 6799
## 4 Multifamily HR (10+) 1217
## 5 SPS-District K-12 928
## 6 Nonresidential COS 641
## 7 Campus 295
## 8 Nonresidential WA 158
For the ANOVA test:
total_ghg_emissions is the same across all building
types.total_ghg_emissions.p_box <- ggplot(analysis_df, aes(x = building_type, y = total_ghg_emissions, fill = building_type)) +
geom_boxplot(alpha = 0.8) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 35, hjust = 1)) +
labs(
title = "Total GHG Emissions by Building Type",
x = "Building Type",
y = "Total GHG Emissions"
) +
guides(fill = "none")
ggplotly(p_box)
The boxplot suggests that emissions differ across building types. Some categories appear to have much higher centers and wider spreads than others.
anova_model <- aov(total_ghg_emissions ~ building_type, data = analysis_df)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## building_type 7 84527529 12075361 701.9 <2e-16 ***
## Residuals 33274 572431870 17204
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
analysis_df %>%
group_by(building_type) %>%
summarise(
n = n(),
mean_ghg = mean(total_ghg_emissions),
median_ghg = median(total_ghg_emissions)
) %>%
arrange(desc(mean_ghg))
## # A tibble: 8 × 4
## building_type n mean_ghg median_ghg
## <chr> <int> <dbl> <dbl>
## 1 Campus 295 259. 147.
## 2 Multifamily HR (10+) 1217 234. 160.
## 3 Nonresidential WA 158 207. 132.
## 4 Nonresidential COS 641 151. 99.6
## 5 NonResidential 12995 107. 43.5
## 6 SPS-District K-12 928 106. 67.2
## 7 Multifamily MR (5-9) 6799 65.8 36.5
## 8 Multifamily LR (1-4) 10249 26.9 7.7
The ANOVA output tests whether average greenhouse gas emissions are the same across all building types. If the p-value is very small (less than 0.05), then I reject the null hypothesis.
Based on the ANOVA results, there is strong evidence that mean greenhouse gas emissions differ by building type. This means building type appears to matter when explaining how much a building emits.
From the summary table and boxplot, some building types tend to have much higher emissions than others. This makes sense in context because different building types have different sizes, equipment demands, hours of operation, and occupancy patterns.
This result is meaningful for city planners, sustainability teams, and property managers. It suggests that emissions reduction strategies should not assume that all buildings behave the same way. Different building categories may require different energy policies, incentives, or retrofit priorities.
For example, if one category consistently has higher average emissions, that category may be a stronger target for building upgrades, efficiency programs, or deeper investigation.
For the regression model, I selected
property_gfa_total as the explanatory
variable.
I chose this variable because building size is likely related to total greenhouse gas emissions. In general, larger buildings may require more lighting, heating, cooling, and equipment use, which can increase emissions. A roughly positive relationship is expected.
p <- ggplot(analysis_df,
aes(x = property_gfa_total,
y = total_ghg_emissions,
color = building_type)) +
geom_point(alpha = 0.6, size = 2) +
geom_smooth(method = "lm", se = TRUE, color = "black") +
theme_minimal() +
labs(
title = "Total GHG Emissions vs Gross Floor Area",
x = "Property Gross Floor Area",
y = "Total GHG Emissions",
color = "Building Type"
)
ggplotly(p)
The scatterplot shows a positive upward trend, which suggests that larger buildings tend to produce more greenhouse gas emissions. The pattern is not perfect, but it is reasonable enough for a simple linear regression model.
lm_model <- lm(total_ghg_emissions ~ property_gfa_total, data = analysis_df)
summary(lm_model)
##
## Call:
## lm(formula = total_ghg_emissions ~ property_gfa_total, data = analysis_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -581.84 -42.38 -29.39 10.49 1322.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.043e+01 8.343e-01 24.49 <2e-16 ***
## property_gfa_total 6.867e-04 5.908e-06 116.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118.5 on 33280 degrees of freedom
## Multiple R-squared: 0.2887, Adjusted R-squared: 0.2887
## F-statistic: 1.351e+04 on 1 and 33280 DF, p-value: < 2.2e-16
tidy(lm_model)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 20.4 0.834 24.5 2.53e-131
## 2 property_gfa_total 0.000687 0.00000591 116. 0
glance(lm_model)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.289 0.289 118. 13510. 0 1 -206141. 412287. 412313.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The regression model estimates greenhouse gas emissions using gross floor area.
The regression equation is:
coef(lm_model)
## (Intercept) property_gfa_total
## 20.4339469 0.0006867
This can be written in the form:
[ = b_0 + b_1 x]
where:
property_gfa_totalThe slope tells us how much the predicted greenhouse gas emissions increase for each one-unit increase in gross floor area.
Because the slope is positive, the model suggests that larger buildings tend to have higher total greenhouse gas emissions.
The intercept is the predicted emissions when gross floor area is 0. In practice, a building cannot really have 0 square feet, so the intercept is mainly part of the mathematical model and is not very meaningful on its own in this context.
The R-squared value tells us how much of the variation in greenhouse gas emissions is explained by gross floor area alone.
This value is helpful because it shows whether building size is a strong predictor by itself. Even if the relationship is statistically significant, a moderate or low R-squared would mean that other factors also play important roles, such as building type, age, equipment, usage patterns, or energy source.
This analysis produced two important insights.
The ANOVA results show that average greenhouse gas emissions differ across building types. This means emissions are not randomly distributed across categories. Instead, the type of building appears to influence environmental impact.
This is significant because it suggests that building-specific strategies may be more effective than one general solution for all properties.
The regression results show a positive relationship between gross floor area and greenhouse gas emissions. In other words, as building size increases, emissions tend to increase as well.
This is significant because it suggests that larger buildings may be good candidates for targeted sustainability efforts. If a city or organization wants to reduce emissions efficiently, it may make sense to focus on large buildings first.
Based on this analysis, I would make the following practical recommendation:
Buildings should not all be treated the same when designing sustainability programs. Since emissions differ by building type, decision-makers should consider category-specific interventions. Also, because larger buildings tend to emit more, large properties may offer the greatest opportunity for emissions reductions through upgrades, retrofits, and operational improvements.
This analysis answered some useful questions, but it also raises new ones.
Some additional questions worth investigating are:
These questions would help develop a more complete understanding of what drives building emissions.
In this data dive, I used the Seattle Building Energy Benchmarking dataset to study factors associated with total greenhouse gas emissions.
First, I used ANOVA to test whether emissions differed across building types. The results showed strong evidence that average emissions are not the same across all categories.
Second, I used simple linear regression to model emissions using gross floor area. The model showed a positive relationship, meaning larger buildings tend to produce more greenhouse gas emissions.
Overall, the results suggest that both building type and building size are useful for understanding emissions. This has real-world importance for environmental planning, energy policy, and building management.