Week 6 Data Dive — Confidence Intervals

1 Introduction
2 Load and clean data
- 2.1 Minimal cleaning (consistent with Week 3–5 style)
3 Pair 1 (Response & Explanatory): Site EUI vs Building Age
4 Pair 2: GHG Emissions Intensity vs an Energy–Emissions Ratio
5 Conclusion

1 Introduction

This data dive uses the Building Energy Benchmarking Data dataset to explore uncertainty in estimates using confidence intervals and to evaluate relationships between variables using visualization and correlation. I have focused on (1) how building age relates to Site Energy Use Intensity (Site EUI) and (2) how GHG emissions intensity relates to an energy–emissions ratio. Confidence intervals help estimate plausible ranges for population means, while correlation and plots help scrutinize whether relationships are stable or driven by outliers.

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(janitor)
library(scales)
library(forcats)

2 Load and clean data

For RPubs, it is easiest if the CSV is in the same folder as this Rmd. If not found, you will be prompted to select it.

data_path <- "Building_Energy_Benchmarking_Data__2015-Present.csv"
if (!file.exists(data_path)) data_path <- file.choose()

data <- read_csv(data_path, show_col_types = FALSE) %>%
  clean_names()

glimpse(data)

## Rows: 34,699
## Columns: 46
## $ ose_building_id                      <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13,…
## $ data_year                            <dbl> 2024, 2024, 2024, 2024, 2024, 202…
## $ building_name                        <chr> "MAYFLOWER PARK HOTEL", "PARAMOUN…
## $ building_type                        <chr> "NonResidential", "NonResidential…
## $ tax_parcel_identification_number     <chr> "659000030", "659000220", "659000…
## $ address                              <chr> "405 OLIVE WAY", "724 PINE ST", "…
## $ city                                 <chr> "SEATTLE", "SEATTLE", "SEATTLE", …
## $ state                                <chr> "WA", "WA", "WA", "WA", "WA", "WA…
## $ zip_code                             <dbl> 98101, 98101, 98101, 98101, 98121…
## $ latitude                             <dbl> 47.61220, 47.61307, 47.61367, 47.…
## $ longitude                            <dbl> -122.3380, -122.3336, -122.3382, …
## $ neighborhood                         <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN…
## $ council_district_code                <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, …
## $ year_built                           <dbl> 1927, 1996, 1969, 1926, 1980, 199…
## $ numberof_floors                      <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15,…
## $ numberof_buildings                   <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ property_gfa_total                   <dbl> 88434, 103566, 956110, 61320, 175…
## $ property_gfa_buildings               <dbl> 88434, 88502, 759392, 61320, 1135…
## $ property_gfa_parking                 <dbl> 0, 15064, 196718, 0, 62000, 37198…
## $ self_report_gfa_total                <dbl> 115387, 103566, 947059, 61320, 20…
## $ self_report_gfa_buildings            <dbl> 115387, 88502, 827566, 61320, 123…
## $ self_report_parking                  <dbl> 0, 15064, 119493, 0, 80497, 40971…
## $ energystar_score                     <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 5…
## $ site_euiwn_k_btu_sf                  <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168…
## $ site_eui_k_btu_sf                    <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167…
## $ site_energy_use_k_btu                <dbl> 7113958, 6330664, 67613264, 52739…
## $ site_energy_use_wn_k_btu             <dbl> 7172158, 6362478, 67852608, 53463…
## $ source_euiwn_k_btu_sf                <dbl> 122.9, 128.7, 171.8, 174.7, 167.6…
## $ source_eui_k_btu_sf                  <dbl> 121.4, 128.3, 171.5, 171.4, 167.2…
## $ epa_property_type                    <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type            <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type_gfa        <dbl> 115387, 88502, 827566, 61320, 123…
## $ second_largest_property_use_type     <chr> NA, "Parking", "Parking", NA, "Pa…
## $ second_largest_property_use_type_gfa <dbl> NA, 15064, 117783, NA, 68009, 409…
## $ third_largest_property_use_type      <chr> NA, NA, "Swimming Pool", NA, "Swi…
## $ third_largest_property_use_type_gfa  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricity_k_wh                     <dbl> 1045040, 787838, 11279080, 796976…
## $ steam_use_k_btu                      <dbl> 1949686, NA, 23256386, 1389935, N…
## $ natural_gas_therms                   <dbl> 15986, 36426, 58726, 11648, 73811…
## $ compliance_status                    <chr> "Not Compliant", "Compliant", "Co…
## $ compliance_issue                     <chr> "Default Data", "No Issue", "No I…
## $ electricity_k_btu                    <dbl> 3565676, 2688104, 38484221, 27192…
## $ natural_gas_k_btu                    <dbl> 1598590, 3642560, 5872650, 116476…
## $ total_ghg_emissions                  <dbl> 263.3, 208.6, 2418.2, 190.1, 417.…
## $ ghg_emissions_intensity              <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.8…
## $ demolished                           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…

2.1 Minimal cleaning (consistent with Week 3–5 style)

df1 <- data %>%
  mutate(
    data_year = as.integer(data_year),
    year_built = as.integer(year_built),
    site_eui_k_btu_sf = as.numeric(site_eui_k_btu_sf),
    ghg_emissions_intensity = as.numeric(ghg_emissions_intensity),
    compliance_status = as.factor(compliance_status),
    largest_property_use_type = as.factor(largest_property_use_type)
  )

3 Pair 1 (Response & Explanatory): Site EUI vs Building Age

3.1 Variables

Response variable (original): site_eui_k_btu_sf
Explanatory variable (mutated): building_age = data_year - year_built

Creating building_age improves interpretability (it expresses time in years instead of raw construction year).

df_age_eui <- df1 %>%
  mutate(
    building_age = data_year - year_built
  ) %>%
  filter(
    !is.na(site_eui_k_btu_sf),
    !is.na(building_age),
    building_age >= 0
  )

3.2 Visualization

r_val <- round(cor(df_age_eui$building_age, df_age_eui$site_eui_k_btu_sf, use = "complete.obs"), 3)

ggplot(df_age_eui, aes(x = building_age, y = site_eui_k_btu_sf)) +
  geom_point(alpha = 0.08, size = 0.9, color = "#A65EB6") +
  geom_smooth(method = "lm", se = TRUE, color = "#6279B8") +
  scale_y_log10(labels = comma) +
  labs(
    title = "Site EUI vs Building Age",
    subtitle = paste("Linear trend with 95% CI band; r =", r_val),
    x = "Building Age (years)",
    y = "Site EUI (kBtu/sf, log10)"
  ) +
  theme_minimal(base_size = 12)

## `geom_smooth()` using formula = 'y ~ x'

3.3 Scrutinize (Insight, significance & further questions)

The plot shows how energy use intensity changes as buildings get older. If the trend line slopes upward, it suggests older buildings tend to have higher Site EUI on average (less energy-efficient), but the wide vertical spread indicates that age alone does not explain energy intensity. Outliers and variability imply that other factors—such as building type, size, or system upgrades likely play a major role. A useful follow-up question is whether the age–EUI relationship becomes stronger within specific building types (e.g., offices only) or after controlling for building size.

3.4 Correlation

cor_age_eui <- cor(df_age_eui$building_age, df_age_eui$site_eui_k_btu_sf, use = "complete.obs")
cor_age_eui

## [1] 0.01223329

Interpretation: The correlation coefficient summarizes the direction and strength of a linear relationship. A small correlation supports what the scatterplot shows: there may be a trend, but it is not strong without considering additional variables.

3.5 95% Confidence interval for the mean Site EUI

ci_site_eui <- t.test(df_age_eui$site_eui_k_btu_sf)
ci_site_eui

## 
##  One Sample t-test
## 
## data:  df_age_eui$site_eui_k_btu_sf
## t = 34.643, df = 33421, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  52.72357 59.04745
## sample estimates:
## mean of x 
##  55.88551

Interpretation: This confidence interval provides a plausible range for the population mean Site EUI. If the interval is wide, the mean estimate is less precise, often due to high variability and outliers.

4 Pair 2: GHG Emissions Intensity vs an Energy–Emissions Ratio

4.1 Variables

Response variable (original): ghg_emissions_intensity
Mutated variable: eui_per_emission = site_eui_k_btu_sf / ghg_emissions_intensity

Because ratios can explode when the denominator is near zero, I filter non-finite values and trim extreme ratios to reduce distortion.

df_ratio_ghg <- df1 %>%
  mutate(
    eui_per_emission = site_eui_k_btu_sf / ghg_emissions_intensity
  ) %>%
  filter(
    !is.na(ghg_emissions_intensity),
    !is.na(site_eui_k_btu_sf),
    !is.na(eui_per_emission),
    is.finite(eui_per_emission),
    ghg_emissions_intensity > 0
  )

# Optional trimming of extreme ratio values to reduce the impact of outliers
ratio_hi <- quantile(df_ratio_ghg$eui_per_emission, 0.99, na.rm = TRUE)
df_ratio_ghg_trim <- df_ratio_ghg %>%
  filter(eui_per_emission <= ratio_hi)

4.2 Visualization

r_val2 <- round(cor(df_ratio_ghg_trim$eui_per_emission,
                    df_ratio_ghg_trim$ghg_emissions_intensity,
                    use = "complete.obs"), 3)

ggplot(df_ratio_ghg_trim, aes(x = eui_per_emission, 
                              y = ghg_emissions_intensity)) +
  geom_point(alpha = 0.25, size = 0.9, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "darkred") +
  scale_x_log10(labels = comma) +
  scale_y_log10(labels = comma) +
  labs(
    title = "GHG Emissions Intensity vs Energy–Emissions Ratio",
    subtitle = paste("Log–log scale with 95% CI band; r =", r_val2),
    x = "EUI per Emission (log scale)",
    y = "GHG Emissions Intensity (log scale)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.title = element_text(face = "bold")
  )

## `geom_smooth()` using formula = 'y ~ x'

4.3 Scrutinize (Insight, significance & further questions)

This plot examines whether emissions intensity changes systematically as the energy–emissions ratio changes. Because the ratio includes emissions in the denominator, the relationship can be sensitive to very small emissions values and it can also create mathematical coupling which may inflate or distort correlation. The trimming step helps prevent a small number of extreme ratios from dominating the pattern.

A practical next step is to repeat this analysis within a single property category (e.g., offices only) and compare whether the relationship is consistent across building uses.

4.4 Correlation

cor_ratio_ghg <- cor(df_ratio_ghg_trim$eui_per_emission, df_ratio_ghg_trim$ghg_emissions_intensity, use = "complete.obs")
cor_ratio_ghg

## [1] -0.06338687

4.5 95% Confidence interval for the mean GHG emissions intensity

ci_ghg <- t.test(df_ratio_ghg_trim$ghg_emissions_intensity)
ci_ghg

## 
##  One Sample t-test
## 
## data:  df_ratio_ghg_trim$ghg_emissions_intensity
## t = 14.149, df = 33062, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1.229067 1.624338
## sample estimates:
## mean of x 
##  1.426703

Interpretation: This interval estimates the population mean GHG emissions intensity for buildings with valid energy and emissions measurements (after trimming). It quantifies uncertainty and should be discussed alongside the distribution shape and outliers.

5 Conclusion

Across both pairs, confidence intervals help translate sample summaries into population-level statements with uncertainty, while correlation and plots reveal how stable (or unstable) relationships are. Building age may show a weak association with energy intensity, but large variability suggests other variables matter. The emissions analysis highlights why careful data preparation is important when using ratios, since extreme values can distort both correlations and visual trends.