This week’s data dive continues using the Building Energy Benchmarking Data dataset, but shifts focus from analysis results to documentation and interpretation. Documentation matters because many columns contain encoded categories, implied units, or missingness patterns that can change how we interpret summaries, trends, and outliers. In this write-up, I identify columns that are unclear without documentation, highlight an area that remains ambiguous even after documentation, and use visualizations to show why interpretation choices matter. I also examine categorical “missing/empty” groups and investigate outliers in a continuous energy metric.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(scales)
library(forcats)
library(skimr)
data <- read_csv("/Users/divya/Desktop/IU/Statistics R Prog/Labs/Assignments/Building_Energy_Benchmarking_Data__2015-Present.csv")
data <- clean_names(data)
df1 <- data %>%
mutate(
data_year = as.integer(data_year),
year_built = as.integer(year_built),
compliance_status = as.factor(compliance_status),
largest_property_use_type = as.factor(largest_property_use_type),
building_type = as.factor(building_type),
neighborhood = as.factor(neighborhood)
)
compliance_statusdf1 %>% count(compliance_status, sort = TRUE)
## # A tibble: 2 × 2
## compliance_status n
## <fct> <int>
## 1 Compliant 32454
## 2 Not Compliant 2245
Why this needs documentation:
Without documentation, it is unclear what each compliance label means
e.g., whether “non-compliant” indicates a regulatory violation,
incomplete reporting, exemptions, timing issues, or data quality
flags.
Why it matters for analysis:
If “non-compliant” partly reflects reporting status rather than building
performance, then comparing energy metrics across compliance groups
could mix regulatory/reporting issues with true efficiency
differences.
Further question:
Are differences in energy metrics by compliance status still present
after controlling for building type and size?
site_eui_k_btu_sfsummary(df1$site_eui_k_btu_sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.10 27.50 36.90 55.88 57.00 44197.90 1275
Why this needs documentation:
The name suggests units (kBtu/sf), but documentation is needed to
confirm the unit meaning, whether values are annualized, and whether
higher values indicate more energy use.
Why it matters for analysis:
If a reader misinterprets the direction thinking higher EUI is “better”,
conclusions about which buildings are efficient would be reversed.
Further question:
Do certain property use types consistently show higher Site EUI, and are
those differences stable over time?
energystar_score missing?Even with documentation, it can remain unclear why some buildings lack ENERGY STAR scores. Missingness could be due to ineligibility, incomplete inputs, or reporting limitations.
df1 %>%
summarise(
total_rows = n(),
missing_energystar = sum(is.na(energystar_score)),
pct_missing = mean(is.na(energystar_score)) * 100
)
## # A tibble: 1 × 3
## total_rows missing_energystar pct_missing
## <int> <int> <dbl>
## 1 34699 9285 26.8
Significance:
If missingness is systematic e.g., concentrated in certain building
types, then analysis using ENERGY STAR scores may be biased toward the
subset of buildings that receive scores.
Further question:
Is missing ENERGY STAR scoring more common in specific building types or
compliance groups?
This plot shows whether missing ENERGY STAR scores are evenly distributed across building types.
df1 %>%
mutate(score_missing = is.na(energystar_score)) %>%
filter(!is.na(building_type)) %>%
count(building_type, score_missing) %>%
group_by(building_type) %>%
mutate(prop = n / sum(n)) %>%
ungroup() %>%
ggplot(aes(x = fct_reorder(building_type, prop, .fun = max), y = prop, fill = score_missing)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Proportion of Missing ENERGY STAR Scores by Building Type",
x = "Building Type",
y = "Proportion within building type",
fill = "ENERGY STAR missing?"
) +
theme_minimal(base_size = 12)
Why this matters:
If some building types have much higher missingness, then comparisons of
ENERGY STAR score across types may reflect availability bias rather than
true performance differences.
This plot compares Site EUI distributions for buildings with vs without an ENERGY STAR score.
df1 %>%
filter(!is.na(site_eui_k_btu_sf)) %>%
mutate(score_missing = is.na(energystar_score)) %>%
ggplot(aes(x = score_missing, y = site_eui_k_btu_sf, fill = score_missing)) +
geom_boxplot(outlier.alpha = 0.2) +
scale_y_continuous(labels = comma) +
labs(
title = "Site EUI Distribution by ENERGY STAR Score Availability",
x = "ENERGY STAR Score Missing",
y = "Site EUI (kBtu/sf)",
fill = "Score Missing"
) +
theme_minimal(base_size = 12)
Why this matters:
If buildings without scores systematically have higher/lower Site EUI,
then missingness is informative and should be discussed in
conclusions.
A key point from documentation is that group summaries only show
groups that exist in the data. We cannot detect “empty groups” (0
counts) unless the factor levels are defined and you set
.drop = FALSE.
compliance_status# Explicit missing values (NA)
sum(is.na(df1$compliance_status))
## [1] 0
# Observed groups only
df1 %>% count(compliance_status)
## # A tibble: 2 × 2
## compliance_status n
## <fct> <int>
## 1 Compliant 32454
## 2 Not Compliant 2245
# Include empty factor levels (if any exist)
df1 %>%
mutate(compliance_status = factor(compliance_status)) %>%
count(compliance_status, .drop = FALSE)
## # A tibble: 2 × 2
## compliance_status n
## <fct> <int>
## 1 Compliant 32454
## 2 Not Compliant 2245
Interpretation:
- sum(is.na(...)) checks explicit missing values.
- count() shows only categories that appear.
- count(..., .drop = FALSE) can show empty levels, but only
if those levels exist in the factor definition.
building_typesum(is.na(df1$building_type))
## [1] 0
df1 %>% count(building_type, sort = TRUE)
## # A tibble: 8 × 2
## building_type n
## <fct> <int>
## 1 NonResidential 13680
## 2 Multifamily LR (1-4) 10542
## 3 Multifamily MR (5-9) 6975
## 4 Multifamily HR (10+) 1275
## 5 SPS-District K-12 938
## 6 Nonresidential COS 651
## 7 Campus 423
## 8 Nonresidential WA 215
df1 %>% mutate(building_type = factor(building_type)) %>% count(building_type, .drop = FALSE)
## # A tibble: 8 × 2
## building_type n
## <fct> <int>
## 1 Campus 423
## 2 Multifamily HR (10+) 1275
## 3 Multifamily LR (1-4) 10542
## 4 Multifamily MR (5-9) 6975
## 5 NonResidential 13680
## 6 Nonresidential COS 651
## 7 Nonresidential WA 215
## 8 SPS-District K-12 938
Significance:
Rare or missing categories matter because small group sizes produce
unstable averages and can change conclusions across samples (a key
lesson from Week 4).
site_eui_k_btu_sf (IQR rule)Q1 <- quantile(df1$site_eui_k_btu_sf, 0.25, na.rm = TRUE)
Q3 <- quantile(df1$site_eui_k_btu_sf, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
df_outliers <- df1 %>%
filter(!is.na(site_eui_k_btu_sf)) %>%
filter(site_eui_k_btu_sf < lower_bound | site_eui_k_btu_sf > upper_bound)
tibble(
lower_bound = lower_bound,
upper_bound = upper_bound,
outlier_rows = nrow(df_outliers)
)
## # A tibble: 1 × 3
## lower_bound upper_bound outlier_rows
## <dbl> <dbl> <int>
## 1 -16.8 101. 2711
df1 %>%
filter(!is.na(site_eui_k_btu_sf)) %>%
ggplot(aes(y = site_eui_k_btu_sf)) +
geom_boxplot(outlier.alpha = 0.25) +
scale_y_continuous(labels = comma) +
labs(
title = "Outliers in Site Energy Use Intensity (Site EUI)",
y = "Site EUI (kBtu/sf)"
) +
theme_minimal(base_size = 12)
Why outliers matter:
Outliers may represent truly energy-intensive buildings, data entry
issues, or differences in building operations. Documentation can help
interpret whether extreme values are indicate measurement/reporting
issues.
Further question:
Are outliers concentrated in certain building types or compliance
categories?
df_outliers %>%
filter(!is.na(building_type)) %>%
count(building_type, sort = TRUE) %>%
slice_head(n = 10)
## # A tibble: 8 × 2
## building_type n
## <fct> <int>
## 1 NonResidential 2026
## 2 Nonresidential COS 203
## 3 Campus 159
## 4 Multifamily LR (1-4) 107
## 5 Nonresidential WA 99
## 6 Multifamily MR (5-9) 66
## 7 Multifamily HR (10+) 49
## 8 SPS-District K-12 2
This analysis shows why careful interpretation is essential when working with real-world benchmarking data. Several columns e.g., compliance status and Site EUI require documentation to avoid incorrect assumptions about meaning and directionality. Even after documentation, some uncertainty remains especially around systematic missingness in ENERGY STAR scores which can bias comparisons. Categorical summaries can hide empty groups unless factor levels are defined, and outliers require both statistical rules and domain judgment. Going forward, I would document assumptions explicitly, report missingness alongside summaries, and use sensitivity checks to ensure conclusions are not driven by rare categories or extreme values.