1 Introduction

This week’s data dive continues using the Building Energy Benchmarking Data dataset, but shifts focus from analysis results to documentation and interpretation. Documentation matters because many columns contain encoded categories, implied units, or missingness patterns that can change how we interpret summaries, trends, and outliers. In this write-up, I identify columns that are unclear without documentation, highlight an area that remains ambiguous even after documentation, and use visualizations to show why interpretation choices matter. I also examine categorical “missing/empty” groups and investigate outliers in a continuous energy metric.

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(janitor)
library(scales)
library(forcats)
library(skimr)

1.1 Load data

data <- read_csv("/Users/divya/Desktop/IU/Statistics R Prog/Labs/Assignments/Building_Energy_Benchmarking_Data__2015-Present.csv")

data <- clean_names(data)

1.2 Minimal cleaning

df1 <- data %>%
  mutate(
    data_year = as.integer(data_year),
    year_built = as.integer(year_built),
    compliance_status = as.factor(compliance_status),
    largest_property_use_type = as.factor(largest_property_use_type),
    building_type = as.factor(building_type),
    neighborhood = as.factor(neighborhood)
  )

2 1) Columns that are unclear without documentation

2.1 compliance_status

df1 %>% count(compliance_status, sort = TRUE)
## # A tibble: 2 × 2
##   compliance_status     n
##   <fct>             <int>
## 1 Compliant         32454
## 2 Not Compliant      2245

Why this needs documentation:
Without documentation, it is unclear what each compliance label means e.g., whether “non-compliant” indicates a regulatory violation, incomplete reporting, exemptions, timing issues, or data quality flags.

Why it matters for analysis:
If “non-compliant” partly reflects reporting status rather than building performance, then comparing energy metrics across compliance groups could mix regulatory/reporting issues with true efficiency differences.

Further question:
Are differences in energy metrics by compliance status still present after controlling for building type and size?


2.2 Example 2: site_eui_k_btu_sf

summary(df1$site_eui_k_btu_sf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     0.10    27.50    36.90    55.88    57.00 44197.90     1275

Why this needs documentation:
The name suggests units (kBtu/sf), but documentation is needed to confirm the unit meaning, whether values are annualized, and whether higher values indicate more energy use.

Why it matters for analysis:
If a reader misinterprets the direction thinking higher EUI is “better”, conclusions about which buildings are efficient would be reversed.

Further question:
Do certain property use types consistently show higher Site EUI, and are those differences stable over time?


3 2) Something unclear even after documentation

3.1 Why is energystar_score missing?

Even with documentation, it can remain unclear why some buildings lack ENERGY STAR scores. Missingness could be due to ineligibility, incomplete inputs, or reporting limitations.

df1 %>%
  summarise(
    total_rows = n(),
    missing_energystar = sum(is.na(energystar_score)),
    pct_missing = mean(is.na(energystar_score)) * 100
  )
## # A tibble: 1 × 3
##   total_rows missing_energystar pct_missing
##        <int>              <int>       <dbl>
## 1      34699               9285        26.8

Significance:
If missingness is systematic e.g., concentrated in certain building types, then analysis using ENERGY STAR scores may be biased toward the subset of buildings that receive scores.

Further question:
Is missing ENERGY STAR scoring more common in specific building types or compliance groups?


4 3) Visualizations showing why documentation/interpretation matters

4.1 Visualization 1: Missing ENERGY STAR scores by building type

This plot shows whether missing ENERGY STAR scores are evenly distributed across building types.

df1 %>%
  mutate(score_missing = is.na(energystar_score)) %>%
  filter(!is.na(building_type)) %>%
  count(building_type, score_missing) %>%
  group_by(building_type) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup() %>%
  ggplot(aes(x = fct_reorder(building_type, prop, .fun = max), y = prop, fill = score_missing)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Proportion of Missing ENERGY STAR Scores by Building Type",
    x = "Building Type",
    y = "Proportion within building type",
    fill = "ENERGY STAR missing?"
  ) +
  theme_minimal(base_size = 12)

Why this matters:
If some building types have much higher missingness, then comparisons of ENERGY STAR score across types may reflect availability bias rather than true performance differences.


4.2 Visualization 2: Site EUI by ENERGY STAR availability

This plot compares Site EUI distributions for buildings with vs without an ENERGY STAR score.

df1 %>%
  filter(!is.na(site_eui_k_btu_sf)) %>%
  mutate(score_missing = is.na(energystar_score)) %>%
  ggplot(aes(x = score_missing, y = site_eui_k_btu_sf, fill = score_missing)) +
  geom_boxplot(outlier.alpha = 0.2) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Site EUI Distribution by ENERGY STAR Score Availability",
    x = "ENERGY STAR Score Missing",
    y = "Site EUI (kBtu/sf)",
    fill = "Score Missing"
  ) +
  theme_minimal(base_size = 12)

Why this matters:
If buildings without scores systematically have higher/lower Site EUI, then missingness is informative and should be discussed in conclusions.


5 4) Categorical columns: explicit missing, implicit missing, and empty groups

A key point from documentation is that group summaries only show groups that exist in the data. We cannot detect “empty groups” (0 counts) unless the factor levels are defined and you set .drop = FALSE.

5.1 compliance_status

# Explicit missing values (NA)
sum(is.na(df1$compliance_status))
## [1] 0
# Observed groups only
df1 %>% count(compliance_status)
## # A tibble: 2 × 2
##   compliance_status     n
##   <fct>             <int>
## 1 Compliant         32454
## 2 Not Compliant      2245
# Include empty factor levels (if any exist)
df1 %>%
  mutate(compliance_status = factor(compliance_status)) %>%
  count(compliance_status, .drop = FALSE)
## # A tibble: 2 × 2
##   compliance_status     n
##   <fct>             <int>
## 1 Compliant         32454
## 2 Not Compliant      2245

Interpretation:
- sum(is.na(...)) checks explicit missing values.
- count() shows only categories that appear.
- count(..., .drop = FALSE) can show empty levels, but only if those levels exist in the factor definition.


5.2 Example B: building_type

sum(is.na(df1$building_type))
## [1] 0
df1 %>% count(building_type, sort = TRUE)
## # A tibble: 8 × 2
##   building_type            n
##   <fct>                <int>
## 1 NonResidential       13680
## 2 Multifamily LR (1-4) 10542
## 3 Multifamily MR (5-9)  6975
## 4 Multifamily HR (10+)  1275
## 5 SPS-District K-12      938
## 6 Nonresidential COS     651
## 7 Campus                 423
## 8 Nonresidential WA      215
df1 %>% mutate(building_type = factor(building_type)) %>% count(building_type, .drop = FALSE)
## # A tibble: 8 × 2
##   building_type            n
##   <fct>                <int>
## 1 Campus                 423
## 2 Multifamily HR (10+)  1275
## 3 Multifamily LR (1-4) 10542
## 4 Multifamily MR (5-9)  6975
## 5 NonResidential       13680
## 6 Nonresidential COS     651
## 7 Nonresidential WA      215
## 8 SPS-District K-12      938

Significance:
Rare or missing categories matter because small group sizes produce unstable averages and can change conclusions across samples (a key lesson from Week 4).


6 5) Outliers in a continuous variable

6.1 Outliers in site_eui_k_btu_sf (IQR rule)

Q1 <- quantile(df1$site_eui_k_btu_sf, 0.25, na.rm = TRUE)
Q3 <- quantile(df1$site_eui_k_btu_sf, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val

df_outliers <- df1 %>%
  filter(!is.na(site_eui_k_btu_sf)) %>%
  filter(site_eui_k_btu_sf < lower_bound | site_eui_k_btu_sf > upper_bound)

tibble(
  lower_bound = lower_bound,
  upper_bound = upper_bound,
  outlier_rows = nrow(df_outliers)
)
## # A tibble: 1 × 3
##   lower_bound upper_bound outlier_rows
##         <dbl>       <dbl>        <int>
## 1       -16.8        101.         2711
df1 %>%
  filter(!is.na(site_eui_k_btu_sf)) %>%
  ggplot(aes(y = site_eui_k_btu_sf)) +
  geom_boxplot(outlier.alpha = 0.25) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Outliers in Site Energy Use Intensity (Site EUI)",
    y = "Site EUI (kBtu/sf)"
  ) +
  theme_minimal(base_size = 12)

Why outliers matter:
Outliers may represent truly energy-intensive buildings, data entry issues, or differences in building operations. Documentation can help interpret whether extreme values are indicate measurement/reporting issues.

Further question:
Are outliers concentrated in certain building types or compliance categories?

df_outliers %>%
  filter(!is.na(building_type)) %>%
  count(building_type, sort = TRUE) %>%
  slice_head(n = 10)
## # A tibble: 8 × 2
##   building_type            n
##   <fct>                <int>
## 1 NonResidential        2026
## 2 Nonresidential COS     203
## 3 Campus                 159
## 4 Multifamily LR (1-4)   107
## 5 Nonresidential WA       99
## 6 Multifamily MR (5-9)    66
## 7 Multifamily HR (10+)    49
## 8 SPS-District K-12        2

7 Conclusion

This analysis shows why careful interpretation is essential when working with real-world benchmarking data. Several columns e.g., compliance status and Site EUI require documentation to avoid incorrect assumptions about meaning and directionality. Even after documentation, some uncertainty remains especially around systematic missingness in ENERGY STAR scores which can bias comparisons. Categorical summaries can hide empty groups unless factor levels are defined, and outliers require both statistical rules and domain judgment. Going forward, I would document assumptions explicitly, report missingness alongside summaries, and use sensitivity checks to ensure conclusions are not driven by rare categories or extreme values.