This data dive uses the Building Energy Benchmarking Data dataset to explore uncertainty in estimates using confidence intervals and to evaluate relationships between variables using visualization and correlation. I have focused on (1) how building age relates to Site Energy Use Intensity (Site EUI) and (2) how GHG emissions intensity relates to an energy–emissions ratio. Confidence intervals help estimate plausible ranges for population means, while correlation and plots help scrutinize whether relationships are stable or driven by outliers.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(scales)
library(forcats)
For RPubs, it is easiest if the CSV is in the same folder as this Rmd. If not found, you will be prompted to select it.
data_path <- "Building_Energy_Benchmarking_Data__2015-Present.csv"
if (!file.exists(data_path)) data_path <- file.choose()
data <- read_csv(data_path, show_col_types = FALSE) %>%
clean_names()
glimpse(data)
## Rows: 34,699
## Columns: 46
## $ ose_building_id <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13,…
## $ data_year <dbl> 2024, 2024, 2024, 2024, 2024, 202…
## $ building_name <chr> "MAYFLOWER PARK HOTEL", "PARAMOUN…
## $ building_type <chr> "NonResidential", "NonResidential…
## $ tax_parcel_identification_number <chr> "659000030", "659000220", "659000…
## $ address <chr> "405 OLIVE WAY", "724 PINE ST", "…
## $ city <chr> "SEATTLE", "SEATTLE", "SEATTLE", …
## $ state <chr> "WA", "WA", "WA", "WA", "WA", "WA…
## $ zip_code <dbl> 98101, 98101, 98101, 98101, 98121…
## $ latitude <dbl> 47.61220, 47.61307, 47.61367, 47.…
## $ longitude <dbl> -122.3380, -122.3336, -122.3382, …
## $ neighborhood <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN…
## $ council_district_code <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, …
## $ year_built <dbl> 1927, 1996, 1969, 1926, 1980, 199…
## $ numberof_floors <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15,…
## $ numberof_buildings <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ property_gfa_total <dbl> 88434, 103566, 956110, 61320, 175…
## $ property_gfa_buildings <dbl> 88434, 88502, 759392, 61320, 1135…
## $ property_gfa_parking <dbl> 0, 15064, 196718, 0, 62000, 37198…
## $ self_report_gfa_total <dbl> 115387, 103566, 947059, 61320, 20…
## $ self_report_gfa_buildings <dbl> 115387, 88502, 827566, 61320, 123…
## $ self_report_parking <dbl> 0, 15064, 119493, 0, 80497, 40971…
## $ energystar_score <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 5…
## $ site_euiwn_k_btu_sf <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168…
## $ site_eui_k_btu_sf <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167…
## $ site_energy_use_k_btu <dbl> 7113958, 6330664, 67613264, 52739…
## $ site_energy_use_wn_k_btu <dbl> 7172158, 6362478, 67852608, 53463…
## $ source_euiwn_k_btu_sf <dbl> 122.9, 128.7, 171.8, 174.7, 167.6…
## $ source_eui_k_btu_sf <dbl> 121.4, 128.3, 171.5, 171.4, 167.2…
## $ epa_property_type <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type_gfa <dbl> 115387, 88502, 827566, 61320, 123…
## $ second_largest_property_use_type <chr> NA, "Parking", "Parking", NA, "Pa…
## $ second_largest_property_use_type_gfa <dbl> NA, 15064, 117783, NA, 68009, 409…
## $ third_largest_property_use_type <chr> NA, NA, "Swimming Pool", NA, "Swi…
## $ third_largest_property_use_type_gfa <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricity_k_wh <dbl> 1045040, 787838, 11279080, 796976…
## $ steam_use_k_btu <dbl> 1949686, NA, 23256386, 1389935, N…
## $ natural_gas_therms <dbl> 15986, 36426, 58726, 11648, 73811…
## $ compliance_status <chr> "Not Compliant", "Compliant", "Co…
## $ compliance_issue <chr> "Default Data", "No Issue", "No I…
## $ electricity_k_btu <dbl> 3565676, 2688104, 38484221, 27192…
## $ natural_gas_k_btu <dbl> 1598590, 3642560, 5872650, 116476…
## $ total_ghg_emissions <dbl> 263.3, 208.6, 2418.2, 190.1, 417.…
## $ ghg_emissions_intensity <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.8…
## $ demolished <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
df1 <- data %>%
mutate(
data_year = as.integer(data_year),
year_built = as.integer(year_built),
site_eui_k_btu_sf = as.numeric(site_eui_k_btu_sf),
ghg_emissions_intensity = as.numeric(ghg_emissions_intensity),
compliance_status = as.factor(compliance_status),
largest_property_use_type = as.factor(largest_property_use_type)
)
site_eui_k_btu_sfbuilding_age = data_year - year_builtCreating building_age improves interpretability (it
expresses time in years instead of raw construction year).
df_age_eui <- df1 %>%
mutate(
building_age = data_year - year_built
) %>%
filter(
!is.na(site_eui_k_btu_sf),
!is.na(building_age),
building_age >= 0
)
r_val <- round(cor(df_age_eui$building_age, df_age_eui$site_eui_k_btu_sf, use = "complete.obs"), 3)
ggplot(df_age_eui, aes(x = building_age, y = site_eui_k_btu_sf)) +
geom_point(alpha = 0.08, size = 0.9, color = "#A65EB6") +
geom_smooth(method = "lm", se = TRUE, color = "#6279B8") +
scale_y_log10(labels = comma) +
labs(
title = "Site EUI vs Building Age",
subtitle = paste("Linear trend with 95% CI band; r =", r_val),
x = "Building Age (years)",
y = "Site EUI (kBtu/sf, log10)"
) +
theme_minimal(base_size = 12)
## `geom_smooth()` using formula = 'y ~ x'
The plot shows how energy use intensity changes as buildings get older. If the trend line slopes upward, it suggests older buildings tend to have higher Site EUI on average (less energy-efficient), but the wide vertical spread indicates that age alone does not explain energy intensity. Outliers and variability imply that other factors—such as building type, size, or system upgrades likely play a major role. A useful follow-up question is whether the age–EUI relationship becomes stronger within specific building types (e.g., offices only) or after controlling for building size.
cor_age_eui <- cor(df_age_eui$building_age, df_age_eui$site_eui_k_btu_sf, use = "complete.obs")
cor_age_eui
## [1] 0.01223329
Interpretation: The correlation coefficient summarizes the direction and strength of a linear relationship. A small correlation supports what the scatterplot shows: there may be a trend, but it is not strong without considering additional variables.
ci_site_eui <- t.test(df_age_eui$site_eui_k_btu_sf)
ci_site_eui
##
## One Sample t-test
##
## data: df_age_eui$site_eui_k_btu_sf
## t = 34.643, df = 33421, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 52.72357 59.04745
## sample estimates:
## mean of x
## 55.88551
Interpretation: This confidence interval provides a plausible range for the population mean Site EUI. If the interval is wide, the mean estimate is less precise, often due to high variability and outliers.
ghg_emissions_intensityeui_per_emission = site_eui_k_btu_sf / ghg_emissions_intensityBecause ratios can explode when the denominator is near zero, I filter non-finite values and trim extreme ratios to reduce distortion.
df_ratio_ghg <- df1 %>%
mutate(
eui_per_emission = site_eui_k_btu_sf / ghg_emissions_intensity
) %>%
filter(
!is.na(ghg_emissions_intensity),
!is.na(site_eui_k_btu_sf),
!is.na(eui_per_emission),
is.finite(eui_per_emission),
ghg_emissions_intensity > 0
)
# Optional trimming of extreme ratio values to reduce the impact of outliers
ratio_hi <- quantile(df_ratio_ghg$eui_per_emission, 0.99, na.rm = TRUE)
df_ratio_ghg_trim <- df_ratio_ghg %>%
filter(eui_per_emission <= ratio_hi)
r_val2 <- round(cor(df_ratio_ghg_trim$eui_per_emission,
df_ratio_ghg_trim$ghg_emissions_intensity,
use = "complete.obs"), 3)
ggplot(df_ratio_ghg_trim, aes(x = eui_per_emission,
y = ghg_emissions_intensity)) +
geom_point(alpha = 0.25, size = 0.9, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "darkred") +
scale_x_log10(labels = comma) +
scale_y_log10(labels = comma) +
labs(
title = "GHG Emissions Intensity vs Energy–Emissions Ratio",
subtitle = paste("Log–log scale with 95% CI band; r =", r_val2),
x = "EUI per Emission (log scale)",
y = "GHG Emissions Intensity (log scale)"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold"),
axis.title = element_text(face = "bold")
)
## `geom_smooth()` using formula = 'y ~ x'
This plot examines whether emissions intensity changes systematically as the energy–emissions ratio changes. Because the ratio includes emissions in the denominator, the relationship can be sensitive to very small emissions values and it can also create mathematical coupling which may inflate or distort correlation. The trimming step helps prevent a small number of extreme ratios from dominating the pattern.
A practical next step is to repeat this analysis within a single property category (e.g., offices only) and compare whether the relationship is consistent across building uses.
cor_ratio_ghg <- cor(df_ratio_ghg_trim$eui_per_emission, df_ratio_ghg_trim$ghg_emissions_intensity, use = "complete.obs")
cor_ratio_ghg
## [1] -0.06338687
ci_ghg <- t.test(df_ratio_ghg_trim$ghg_emissions_intensity)
ci_ghg
##
## One Sample t-test
##
## data: df_ratio_ghg_trim$ghg_emissions_intensity
## t = 14.149, df = 33062, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1.229067 1.624338
## sample estimates:
## mean of x
## 1.426703
Interpretation: This interval estimates the population mean GHG emissions intensity for buildings with valid energy and emissions measurements (after trimming). It quantifies uncertainty and should be discussed alongside the distribution shape and outliers.
Across both pairs, confidence intervals help translate sample summaries into population-level statements with uncertainty, while correlation and plots reveal how stable (or unstable) relationships are. Building age may show a weak association with energy intensity, but large variability suggests other variables matter. The emissions analysis highlights why careful data preparation is important when using ratios, since extreme values can distort both correlations and visual trends.