Week 7 Data Dive — Hypothesis Testing

1 Introduction
- 1.1 Load data
- 1.2 Minimal cleaning
2 Hypothesis Test 1 — Neyman–Pearson Framework
3 Hypothesis Test 2 — Fisher’s Significance Testing Framework
4 Conclusion

1 Introduction

The Building Energy Benchmarking dataset undergoes hypothesis testing analysis to assess whether observed group differences have statistical significance or result from random sampling errors. The report examines Site Energy Use Intensity (Site EUI) differences between compliance status groups and GHG emissions intensity differences between older buildings and newer buildings based on previous exploratory analysis work. Two hypothesis tests are conducted using two different statistical frameworks: the Neyman–Pearson framework (decision-based with a chosen \(\alpha\)) and Fisher’s significance testing (evidence via a p-value). The tests provide stronger backing for conclusions drawn from actual data.

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(janitor)
library(scales)
library(forcats)

1.1 Load data

data_path <- "Building_Energy_Benchmarking_Data__2015-Present.csv"
if (!file.exists(data_path)) data_path <- file.choose()

data <- read_csv(data_path, show_col_types = FALSE) %>%
  clean_names()

glimpse(data)

## Rows: 34,699
## Columns: 46
## $ ose_building_id                      <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13,…
## $ data_year                            <dbl> 2024, 2024, 2024, 2024, 2024, 202…
## $ building_name                        <chr> "MAYFLOWER PARK HOTEL", "PARAMOUN…
## $ building_type                        <chr> "NonResidential", "NonResidential…
## $ tax_parcel_identification_number     <chr> "659000030", "659000220", "659000…
## $ address                              <chr> "405 OLIVE WAY", "724 PINE ST", "…
## $ city                                 <chr> "SEATTLE", "SEATTLE", "SEATTLE", …
## $ state                                <chr> "WA", "WA", "WA", "WA", "WA", "WA…
## $ zip_code                             <dbl> 98101, 98101, 98101, 98101, 98121…
## $ latitude                             <dbl> 47.61220, 47.61307, 47.61367, 47.…
## $ longitude                            <dbl> -122.3380, -122.3336, -122.3382, …
## $ neighborhood                         <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN…
## $ council_district_code                <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, …
## $ year_built                           <dbl> 1927, 1996, 1969, 1926, 1980, 199…
## $ numberof_floors                      <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15,…
## $ numberof_buildings                   <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ property_gfa_total                   <dbl> 88434, 103566, 956110, 61320, 175…
## $ property_gfa_buildings               <dbl> 88434, 88502, 759392, 61320, 1135…
## $ property_gfa_parking                 <dbl> 0, 15064, 196718, 0, 62000, 37198…
## $ self_report_gfa_total                <dbl> 115387, 103566, 947059, 61320, 20…
## $ self_report_gfa_buildings            <dbl> 115387, 88502, 827566, 61320, 123…
## $ self_report_parking                  <dbl> 0, 15064, 119493, 0, 80497, 40971…
## $ energystar_score                     <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 5…
## $ site_euiwn_k_btu_sf                  <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168…
## $ site_eui_k_btu_sf                    <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167…
## $ site_energy_use_k_btu                <dbl> 7113958, 6330664, 67613264, 52739…
## $ site_energy_use_wn_k_btu             <dbl> 7172158, 6362478, 67852608, 53463…
## $ source_euiwn_k_btu_sf                <dbl> 122.9, 128.7, 171.8, 174.7, 167.6…
## $ source_eui_k_btu_sf                  <dbl> 121.4, 128.3, 171.5, 171.4, 167.2…
## $ epa_property_type                    <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type            <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type_gfa        <dbl> 115387, 88502, 827566, 61320, 123…
## $ second_largest_property_use_type     <chr> NA, "Parking", "Parking", NA, "Pa…
## $ second_largest_property_use_type_gfa <dbl> NA, 15064, 117783, NA, 68009, 409…
## $ third_largest_property_use_type      <chr> NA, NA, "Swimming Pool", NA, "Swi…
## $ third_largest_property_use_type_gfa  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricity_k_wh                     <dbl> 1045040, 787838, 11279080, 796976…
## $ steam_use_k_btu                      <dbl> 1949686, NA, 23256386, 1389935, N…
## $ natural_gas_therms                   <dbl> 15986, 36426, 58726, 11648, 73811…
## $ compliance_status                    <chr> "Not Compliant", "Compliant", "Co…
## $ compliance_issue                     <chr> "Default Data", "No Issue", "No I…
## $ electricity_k_btu                    <dbl> 3565676, 2688104, 38484221, 27192…
## $ natural_gas_k_btu                    <dbl> 1598590, 3642560, 5872650, 116476…
## $ total_ghg_emissions                  <dbl> 263.3, 208.6, 2418.2, 190.1, 417.…
## $ ghg_emissions_intensity              <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.8…
## $ demolished                           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…

1.2 Minimal cleaning

df1 <- data %>%
  mutate(
    compliance_status = as.factor(compliance_status),
    year_built = as.integer(year_built),
    site_eui_k_btu_sf = as.numeric(site_eui_k_btu_sf),
    ghg_emissions_intensity = as.numeric(ghg_emissions_intensity),

    # Define an age-group variable for hypothesis testing
    building_age_group = case_when(
      year_built < 1980 ~ "Older",
      year_built >= 1980 ~ "Newer",
      TRUE ~ NA_character_
    ) %>% factor(levels = c("Older", "Newer"))
  )

2 Hypothesis Test 1 — Neyman–Pearson Framework

2.1 Research question

Do compliant buildings have a different mean Site EUI than non-compliant buildings?

2.2 1) Hypotheses

Null hypothesis (H₀): \(\mu_{\text{Compliant}} = \mu_{\text{Non-compliant}}\)
Alternative hypothesis (H₁): \(\mu_{\text{Compliant}} \neq \mu_{\text{Non-compliant}}\)

2.3 2) Test choice and decision rule

Test: two-sample t-test (Welch’s t-test by default in R)
Significance level: \(\alpha = 0.05\)
Decision rule: Reject H₀ if p-value < 0.05

This matches the Neyman–Pearson framework because we set \(\alpha\) first and make a reject/fail-to-reject decision.

2.4 3) Prepare test data

test1_data <- df1 %>%
  filter(
    !is.na(site_eui_k_btu_sf),
    compliance_status %in% c("Compliant", "Non-Compliant")
  ) %>%
  mutate(compliance_status = droplevels(compliance_status))

test1_data %>% count(compliance_status)

## # A tibble: 1 × 2
##   compliance_status     n
##   <fct>             <int>
## 1 Compliant         31938

2.5 Visualization

test1_plot_df <- test1_data %>%
  mutate(compliance_status = as.factor(compliance_status))

status_n <- test1_plot_df %>% count(compliance_status)

y_lim <- quantile(test1_plot_df$site_eui_k_btu_sf, c(0.02, 0.98), na.rm = TRUE)

ggplot(test1_plot_df, aes(x = compliance_status, y = site_eui_k_btu_sf, fill = compliance_status)) +
  geom_boxplot(width = 0.6, outlier.alpha = 0.15) +
  stat_summary(fun = median, geom = "point", size = 2, color = "black") +
  coord_cartesian(ylim = y_lim) +
  scale_x_discrete(labels = function(x) {
    n_map <- setNames(status_n$n, status_n$compliance_status)
    paste0(x, "\n(n=", n_map[x], ")")
  }) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Site EUI by Compliance Status",
    subtitle = "Boxplots compare distributions; black dots show medians",
    x = "Compliance Status",
    y = "Site EUI (kBtu/sf)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

2.6 Run the test

df1 %>%
  count(compliance_status, sort = TRUE)

## # A tibble: 2 × 2
##   compliance_status     n
##   <fct>             <int>
## 1 Compliant         32454
## 2 Not Compliant      2245

df1 <- df1 %>%
  mutate(
    compliance_status_std = compliance_status %>%
      as.character() %>%
      stringr::str_trim() %>%         
      stringr::str_to_lower()          
  )

df1 %>% count(compliance_status_std, sort = TRUE)

## # A tibble: 2 × 2
##   compliance_status_std     n
##   <chr>                 <int>
## 1 compliant             32454
## 2 not compliant          2245

test1_data <- df1 %>%
  filter(
    compliance_status_std %in% c("compliant", "non-compliant"),
    !is.na(site_eui_k_btu_sf)
  ) %>%
  mutate(
    compliance_status_std = factor(compliance_status_std,
                                   levels = c("compliant", "non-compliant"))
  )

df1 %>% 
  mutate(cs = compliance_status %>% as.character() %>% stringr::str_trim()) %>%
  count(cs, sort = TRUE)

## # A tibble: 2 × 2
##   cs                n
##   <chr>         <int>
## 1 Compliant     32454
## 2 Not Compliant  2245

df1 <- df1 %>%
  mutate(
    cs_raw = compliance_status %>% as.character() %>% stringr::str_trim() %>% stringr::str_to_lower(),
    compliance_2 = case_when(
      stringr::str_detect(cs_raw, "^non") ~ "non-compliant",
      stringr::str_detect(cs_raw, "non\\s*compliant") ~ "non-compliant",
      stringr::str_detect(cs_raw, "compliant") ~ "compliant",
      TRUE ~ NA_character_
    ) %>% factor(levels = c("compliant", "non-compliant"))
  )

df1 %>% count(compliance_2, sort = TRUE)

## # A tibble: 1 × 2
##   compliance_2     n
##   <fct>        <int>
## 1 compliant    34699

test1_data <- df1 %>%
  filter(!is.na(site_eui_k_btu_sf), !is.na(compliance_2))

test1_data %>% count(compliance_2)

## # A tibble: 1 × 2
##   compliance_2     n
##   <fct>        <int>
## 1 compliant    33424

test1 <- t.test(site_eui_k_btu_sf ~ compliance_status_std, data = test1_data)
test1

## 
##  Welch Two Sample t-test
## 
## data:  site_eui_k_btu_sf by compliance_status_std
## t = 1.9372, df = 6096.7, p-value = 0.05277
## alternative hypothesis: true difference in means between group compliant and group not compliant is not equal to 0
## 95 percent confidence interval:
##  -0.05415901  9.12107075
## sample estimates:
##     mean in group compliant mean in group not compliant 
##                    56.08494                    51.55148

2.7 Interpretation

This test evaluates whether the difference in mean Site EUI between compliant and non-compliant buildings is large enough that it would be unlikely to occur by random sampling variation alone. Under the Neyman–Pearson framework, a p-value below \(\alpha = 0.05\) leads to rejecting H₀, providing evidence that compliance status is associated with a meaningful difference in energy use intensity. If H₀ is not rejected, it suggests that compliance status alone may not strongly separate energy intensity, and other drivers such as building type, size, or year built may be more important.

3 Hypothesis Test 2 — Fisher’s Significance Testing Framework

3.1 Research question

Do older buildings have a different mean GHG emissions intensity than newer buildings?

3.2 1) Hypotheses

Null hypothesis (H₀): \(\mu_{\text{Older}} = \mu_{\text{Newer}}\)
Alternative hypothesis (H₁): \(\mu_{\text{Older}} \neq \mu_{\text{Newer}}\)

3.3 2) Fisher framework note

In Fisher’s approach, the emphasis is on the p-value as strength of evidence against H₀. Smaller p-values indicate stronger evidence that the observed difference is inconsistent with the null model.

3.4 3) Prepare test data

test2_data <- df1 %>%
  filter(
    !is.na(ghg_emissions_intensity),
    !is.na(building_age_group)
  )

test2_data %>% count(building_age_group)

## # A tibble: 2 × 2
##   building_age_group     n
##   <fct>              <int>
## 1 Older              16990
## 2 Newer              16832

3.5 4) Visualization

library(dplyr)
library(forcats)
library(scales)
library(ggplot2)

# Ensure it’s a factor and set a logical order (edit labels if yours differ)
plot2_df <- test2_data %>%
  mutate(
    building_age_group = as.factor(building_age_group),
    building_age_group = fct_relevel(building_age_group, "Older (< 1980)", "Newer (≥ 1980)")
  )

# n per group for labeling
age_n <- plot2_df %>% count(building_age_group)

# Zoom to reduce extreme outlier compression (does NOT delete points)
y_lim <- quantile(plot2_df$ghg_emissions_intensity, c(0.02, 0.98), na.rm = TRUE)

ggplot(plot2_df, aes(x = building_age_group, y = ghg_emissions_intensity, fill = building_age_group)) +
  geom_boxplot(width = 0.6, outlier.alpha = 0.15) +
  stat_summary(fun = median, geom = "point", size = 2, color = "black") +
  coord_cartesian(ylim = y_lim) +
  scale_x_discrete(labels = function(x) {
    n_map <- setNames(age_n$n, age_n$building_age_group)
    paste0(x, "\n(n=", n_map[x], ")")
  }) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "GHG Emissions Intensity by Building Age Group",
    subtitle = "Boxplots compare distributions; black dots show medians (view zoomed to 2nd–98th percentile)",
    x = "Building Age Group",
    y = "GHG Emissions Intensity"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

### 5) Run the test

test2 <- t.test(ghg_emissions_intensity ~ building_age_group, data = test2_data)
test2

## 
##  Welch Two Sample t-test
## 
## data:  ghg_emissions_intensity by building_age_group
## t = 2.705, df = 32356, p-value = 0.006833
## alternative hypothesis: true difference in means between group Older and group Newer is not equal to 0
## 95 percent confidence interval:
##  0.1470064 0.9205283
## sample estimates:
## mean in group Older mean in group Newer 
##            1.666947            1.133180

3.6 Interpretation

The test evaluates whether the existing difference in mean emissions intensity between older buildings and newer buildings can occur when there is actually no difference between the two groups. The Fisher framework uses p-value results as evidence when a small p-value indicates that H₀ is improbable and the conclusion shows that building age affects emissions intensity. The situation holds practical significance because older buildings display their original building codes and mechanical systems while evidence of increased emissions intensity triggers both retrofit and efficiency improvement efforts.

4 Conclusion

The two assessments use hypothesis testing as a standardized method to determine whether actual group differences represent authentic patterns which statistical tests show to be present in their results. The Neyman–Pearson test used a fixed \(\alpha\) decision rule to test for compliance status differences in Site EUI while Fisher’s significance testing showed that emissions intensity varied according to building age. Future work could extend these analyses by (1) testing additional group definitions (e.g., by building type), (2) stratifying or controlling for confounders like floor area and use type, and (3) reporting effect sizes and confidence intervals to complement p-values.