Introduction

This analysis uses the Building Energy Benchmarking Data (2015–Present) dataset, which contains detailed information on building characteristics, energy usage, greenhouse gas emissions, and compliance status for benchmarked buildings. The dataset includes both categorical variables (such as building type, neighborhood, and compliance status) and numerical variables (such as ENERGY STAR score, Site EUI, and GHG emissions intensity). To support grouped analysis and probability calculations, the data were cleaned using standardized column names and minimal type conversions. A binned version of year_built was created to enable comparisons across construction eras rather than individual years.

library(tidyverse)
library(janitor)
library(scales)
df <- read_csv("/Users/divya/Desktop/IU/Statistics R Prog/Labs/Assignments/Building_Energy_Benchmarking_Data__2015-Present.csv")

df <- clean_names(df)

glimpse(df)

## Rows: 34,699
## Columns: 46
## $ ose_building_id                      <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13,…
## $ data_year                            <dbl> 2024, 2024, 2024, 2024, 2024, 202…
## $ building_name                        <chr> "MAYFLOWER PARK HOTEL", "PARAMOUN…
## $ building_type                        <chr> "NonResidential", "NonResidential…
## $ tax_parcel_identification_number     <chr> "659000030", "659000220", "659000…
## $ address                              <chr> "405 OLIVE WAY", "724 PINE ST", "…
## $ city                                 <chr> "SEATTLE", "SEATTLE", "SEATTLE", …
## $ state                                <chr> "WA", "WA", "WA", "WA", "WA", "WA…
## $ zip_code                             <dbl> 98101, 98101, 98101, 98101, 98121…
## $ latitude                             <dbl> 47.61220, 47.61307, 47.61367, 47.…
## $ longitude                            <dbl> -122.3380, -122.3336, -122.3382, …
## $ neighborhood                         <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN…
## $ council_district_code                <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, …
## $ year_built                           <dbl> 1927, 1996, 1969, 1926, 1980, 199…
## $ numberof_floors                      <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15,…
## $ numberof_buildings                   <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ property_gfa_total                   <dbl> 88434, 103566, 956110, 61320, 175…
## $ property_gfa_buildings               <dbl> 88434, 88502, 759392, 61320, 1135…
## $ property_gfa_parking                 <dbl> 0, 15064, 196718, 0, 62000, 37198…
## $ self_report_gfa_total                <dbl> 115387, 103566, 947059, 61320, 20…
## $ self_report_gfa_buildings            <dbl> 115387, 88502, 827566, 61320, 123…
## $ self_report_parking                  <dbl> 0, 15064, 119493, 0, 80497, 40971…
## $ energystar_score                     <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 5…
## $ site_euiwn_k_btu_sf                  <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168…
## $ site_eui_k_btu_sf                    <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167…
## $ site_energy_use_k_btu                <dbl> 7113958, 6330664, 67613264, 52739…
## $ site_energy_use_wn_k_btu             <dbl> 7172158, 6362478, 67852608, 53463…
## $ source_euiwn_k_btu_sf                <dbl> 122.9, 128.7, 171.8, 174.7, 167.6…
## $ source_eui_k_btu_sf                  <dbl> 121.4, 128.3, 171.5, 171.4, 167.2…
## $ epa_property_type                    <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type            <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type_gfa        <dbl> 115387, 88502, 827566, 61320, 123…
## $ second_largest_property_use_type     <chr> NA, "Parking", "Parking", NA, "Pa…
## $ second_largest_property_use_type_gfa <dbl> NA, 15064, 117783, NA, 68009, 409…
## $ third_largest_property_use_type      <chr> NA, NA, "Swimming Pool", NA, "Swi…
## $ third_largest_property_use_type_gfa  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricity_k_wh                     <dbl> 1045040, 787838, 11279080, 796976…
## $ steam_use_k_btu                      <dbl> 1949686, NA, 23256386, 1389935, N…
## $ natural_gas_therms                   <dbl> 15986, 36426, 58726, 11648, 73811…
## $ compliance_status                    <chr> "Not Compliant", "Compliant", "Co…
## $ compliance_issue                     <chr> "Default Data", "No Issue", "No I…
## $ electricity_k_btu                    <dbl> 3565676, 2688104, 38484221, 27192…
## $ natural_gas_k_btu                    <dbl> 1598590, 3642560, 5872650, 116476…
## $ total_ghg_emissions                  <dbl> 263.3, 208.6, 2418.2, 190.1, 417.…
## $ ghg_emissions_intensity              <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.8…
## $ demolished                           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…

skimr::skim(df)

Data summary
Name	df
Number of rows	34699
Number of columns	46
_______________________
Column type frequency:
character	13
logical	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
building_name	0	1.00	3	70	3785
building_type	0	1.00	6	20	8
tax_parcel_identification_number	0	1.00	1	11	3691
address	0	1.00	9	33	3759
city	0	1.00	7	7	2
state	0	1.00	2	2	1
neighborhood	0	1.00	4	22	13
epa_property_type	21	1.00	5	52	71
largest_property_use_type	21	1.00	5	52	69
second_largest_property_use_type	14465	0.58	5	52	66
third_largest_property_use_type	26623	0.23	5	52	57
compliance_status	0	1.00	9	13	2
compliance_issue	459	0.99	8	61	12

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
demolished	0	1	0.01	FAL: 34370, TRU: 329

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ose_building_id	0	1.00	23064.46	13665.32	1.00	20224.00	23454.00	26671.00	5.204600e+04	▂▂▇▁▂
data_year	0	1.00	2019.65	2.86	2015.00	2017.00	2020.00	2022.00	2.024000e+03	▇▇▇▇▇
zip_code	0	1.00	98116.54	16.74	98101.00	98105.00	98112.00	98122.00	9.819900e+04	▇▃▁▁▁
latitude	0	1.00	47.62	0.05	47.50	47.60	47.62	47.66	4.773000e+01	▁▂▇▃▂
longitude	0	1.00	-122.33	0.03	-122.41	-122.35	-122.33	-122.32	-1.222600e+02	▁▃▇▃▁
council_district_code	0	1.00	4.22	2.24	1.00	2.00	4.00	7.00	7.000000e+00	▆▃▂▂▇
year_built	0	1.00	1972.26	34.26	1900.00	1952.00	1979.00	2001.00	2.023000e+03	▃▃▆▇▇
numberof_floors	0	1.00	5.03	5.66	0.00	3.00	4.00	6.00	7.600000e+01	▇▁▁▁▁
numberof_buildings	374	0.99	1.27	5.74	1.00	1.00	1.00	1.00	3.000000e+02	▇▁▁▁▁
property_gfa_total	0	1.00	106983.36	300172.30	18481.00	29522.00	46527.00	98220.00	1.521647e+07	▇▁▁▁▁
property_gfa_buildings	0	1.00	90968.40	282479.55	12806.00	27500.00	43228.00	87513.00	1.521647e+07	▇▁▁▁▁
property_gfa_parking	0	1.00	14767.03	46665.34	0.00	0.00	0.00	5571.00	6.867500e+05	▇▁▁▁▁
self_report_gfa_total	0	1.00	108377.80	302851.80	0.00	29351.00	46677.00	100536.00	1.521647e+07	▇▁▁▁▁
self_report_gfa_buildings	0	1.00	90885.16	282718.77	0.00	27200.00	42952.00	87588.00	1.521647e+07	▇▁▁▁▁
self_report_parking	0	1.00	17492.64	49046.64	0.00	0.00	0.00	12000.00	7.867710e+05	▇▁▁▁▁
energystar_score	9285	0.73	72.36	25.49	1.00	59.00	80.00	93.00	1.000000e+02	▁▁▂▃▇
site_euiwn_k_btu_sf	1642	0.95	56.58	299.82	0.10	27.90	37.50	58.00	4.419800e+04	▇▁▁▁▁
site_eui_k_btu_sf	1275	0.96	55.88	294.91	0.10	27.50	36.90	57.00	4.419790e+04	▇▁▁▁▁
site_energy_use_k_btu	1268	0.96	9330387.13	562328180.72	433.00	945454.50	1845118.00	4290153.00	1.019950e+11	▇▁▁▁▁
site_energy_use_wn_k_btu	1634	0.95	9325697.04	565374694.75	433.00	956923.00	1869044.00	4326666.00	1.020130e+11	▇▁▁▁▁
source_euiwn_k_btu_sf	1642	0.95	123.20	345.69	-2.10	69.50	89.00	125.20	4.646800e+04	▇▁▁▁▁
source_eui_k_btu_sf	1275	0.96	121.96	339.82	-2.00	68.70	87.60	123.80	4.646780e+04	▇▁▁▁▁
largest_property_use_type_gfa	21	1.00	85121.83	245489.12	5100.00	25880.00	41256.00	81526.00	1.521647e+07	▇▁▁▁▁
second_largest_property_use_type_gfa	16191	0.53	33818.68	58935.77	1.00	6499.00	13252.00	34164.00	7.521630e+05	▇▁▁▁▁
third_largest_property_use_type_gfa	27474	0.21	14288.24	30452.70	35.00	3000.00	6071.00	12851.00	4.806250e+05	▇▁▁▁▁
electricity_k_wh	956	0.97	1080161.25	4784350.54	1.00	190533.00	349144.00	825645.00	2.807280e+08	▇▁▁▁▁
steam_use_k_btu	33504	0.03	26768064.20	318465114.72	1194.00	958647.50	2246909.00	5786526.00	7.060000e+09	▇▁▁▁▁
natural_gas_therms	13421	0.61	25253.70	192336.85	1.00	4154.25	9631.00	21425.00	2.392346e+07	▇▁▁▁▁
electricity_k_btu	956	0.97	3685510.20	16324204.03	3.00	650098.00	1191278.00	2817100.00	9.578439e+08	▇▁▁▁▁
natural_gas_k_btu	13414	0.61	2524538.99	19230576.90	4.00	415126.00	962770.00	2142000.00	2.392346e+09	▇▁▁▁▁
total_ghg_emissions	855	0.98	182.02	4986.97	0.10	7.90	32.50	93.60	5.816548e+05	▇▁▁▁▁
ghg_emissions_intensity	877	0.97	1.40	18.13	0.01	0.18	0.63	1.39	2.547820e+03	▇▁▁▁▁

2) Minimal cleaning

Minimal data cleaning was applied to prepare the dataset for grouping and probability analysis. Key variables were converted to appropriate data types, construction years were binned into intervals for interpretability, and records with missing values in essential grouping fields were removed to ensure reliable group counts.

df1 <- df %>%
  mutate(
    data_year = as.integer(data_year),
    year_built = as.integer(year_built),
    council_district_code = as.factor(council_district_code),
    demolished = as.factor(demolished),
    compliance_status = as.factor(compliance_status),
    building_type = as.factor(building_type),
    neighborhood = as.factor(neighborhood),
    epa_property_type = as.factor(epa_property_type),
    # helpful binned variable using a cut_ function (required/encouraged)
    year_built_bin = cut_interval(year_built, length = 20)  # ~20-year bins
  ) %>%
  filter(!is.na(building_type), !is.na(neighborhood), !is.na(data_year))

Helper Function for Probability Calculation

The helper function converts group counts into probabilities, enabling interpretation of each group’s relative likelihood within the dataset and systematically identifying the rarest group for further analysis.

library(tidyverse)
add_prob_and_tag <- function(group_df, n_col = "n", tag_n = 1) {
  group_df %>%
    mutate(
      prob = .data[[n_col]] / sum(.data[[n_col]]),
      rarity_tag = if_else(rank(.data[[n_col]], ties.method = "min") <= tag_n,
                           "LOWEST_PROB_GROUP", "other")
    ) %>%
    arrange(.data[[n_col]])
}

4) Group By #1

This grouping examines how ENERGY STAR scores vary across different building types. The results show that building types differ substantially in both the number of benchmarked records and their average energy performance.

4A) Build the grouped summary

If a single building record is selected at random from the dataset, the probability that it belongs to a given building type is proportional to the number of records for that type. The building type with the smallest count represents the lowest-probability group, meaning it is the rarest building type in this dataset.

gb1 <- df1 %>%
  filter(!is.na(energystar_score)) %>%
  group_by(building_type) %>%
  summarise(
    n = n(),
    mean_es = mean(energystar_score, na.rm = TRUE),
    median_es = median(energystar_score, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  add_prob_and_tag(tag_n = 1)

gb1

## # A tibble: 8 × 6
##   building_type            n mean_es median_es    prob rarity_tag       
##   <fct>                <int>   <dbl>     <dbl>   <dbl> <chr>            
## 1 Nonresidential COS      95    51.2        54 0.00374 LOWEST_PROB_GROUP
## 2 Nonresidential WA      106    51.3        55 0.00417 other            
## 3 Campus                 159    64.5        69 0.00626 other            
## 4 SPS-District K-12      905    75.1        79 0.0356  other            
## 5 Multifamily HR (10+)  1096    61.7        68 0.0431  other            
## 6 Multifamily MR (5-9)  5638    83.3        92 0.222   other            
## 7 Multifamily LR (1-4)  8515    75.1        81 0.335   other            
## 8 NonResidential        8900    64.4        72 0.350   other

Probability Interpretation

Significance

Understanding which building types are rare, it is important because small sample sizes can influence averages and may reflect structural differences in the city’s building stock or benchmarking requirements. Differences in ENERGY STAR performance by building type also highlight opportunities for targeted energy efficiency improvements.

Testable Hypothesis

Building types with fewer benchmarked records will have greater variability in ENERGY STAR scores than building types with large sample sizes.

4C) Visualization

ggplot(gb1, aes(x = reorder(building_type, n), y = n)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Counts by Building Type (rarity = lowest probability group)",
    x = "Building type",
    y = "Number of records"
  )

5) Group By #2

This analysis groups buildings by neighborhood to examine differences in average greenhouse gas emissions intensity. The results indicate that neighborhoods vary widely in both the number of benchmarked buildings and their emissions profiles.

Probability Interpretation

The probability that a randomly selected record comes from a particular neighborhood depends on how many benchmarked buildings are present in that neighborhood. The neighborhood with the smallest count has the lowest probability of selection, making it the rarest neighborhood group in the dataset.

5A) Build the grouped summary

gb2 <- df1 %>%
  filter(!is.na(ghg_emissions_intensity)) %>%
  group_by(neighborhood) %>%
  summarise(
    n = n(),
    mean_ghg_int = mean(ghg_emissions_intensity, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  add_prob_and_tag(tag_n = 1)

gb2

## # A tibble: 13 × 5
##    neighborhood               n mean_ghg_int   prob rarity_tag       
##    <fct>                  <int>        <dbl>  <dbl> <chr>            
##  1 DELRIDGE NEIGHBORHOODS   802        2.16  0.0237 LOWEST_PROB_GROUP
##  2 SOUTHEAST                992        1.04  0.0293 other            
##  3 BALLARD                 1361        1.09  0.0402 other            
##  4 CENTRAL                 1367        1.14  0.0404 other            
##  5 NORTH                   1420        0.747 0.0420 other            
##  6 SOUTHWEST               1686        1.13  0.0498 other            
##  7 NORTHWEST               2526        1.07  0.0747 other            
##  8 LAKE UNION              2791        1.20  0.0825 other            
##  9 NORTHEAST               2862        1.25  0.0846 other            
## 10 GREATER DUWAMISH        3439        2.01  0.102  other            
## 11 MAGNOLIA / QUEEN ANNE   4312        1.02  0.127  other            
## 12 EAST                    4566        1.54  0.135  other            
## 13 DOWNTOWN                5698        1.87  0.168  other

gb2 <- df1 %>%
  filter(!is.na(ghg_emissions_intensity)) %>%
  group_by(neighborhood) %>%
  summarise(
    n = n(),
    mean_ghg_int = mean(ghg_emissions_intensity, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  add_prob_and_tag(tag_n = 1)

gb2

## # A tibble: 13 × 5
##    neighborhood               n mean_ghg_int   prob rarity_tag       
##    <fct>                  <int>        <dbl>  <dbl> <chr>            
##  1 DELRIDGE NEIGHBORHOODS   802        2.16  0.0237 LOWEST_PROB_GROUP
##  2 SOUTHEAST                992        1.04  0.0293 other            
##  3 BALLARD                 1361        1.09  0.0402 other            
##  4 CENTRAL                 1367        1.14  0.0404 other            
##  5 NORTH                   1420        0.747 0.0420 other            
##  6 SOUTHWEST               1686        1.13  0.0498 other            
##  7 NORTHWEST               2526        1.07  0.0747 other            
##  8 LAKE UNION              2791        1.20  0.0825 other            
##  9 NORTHEAST               2862        1.25  0.0846 other            
## 10 GREATER DUWAMISH        3439        2.01  0.102  other            
## 11 MAGNOLIA / QUEEN ANNE   4312        1.02  0.127  other            
## 12 EAST                    4566        1.54  0.135  other            
## 13 DOWNTOWN                5698        1.87  0.168  other

Significance

Neighborhood-level differences in emissions intensity may reflect variations in building age, usage patterns, infrastructure, or zoning policies. Identifying neighborhoods with higher emissions intensity can inform targeted sustainability initiatives.

Testable Hypothesis

Neighborhoods with fewer benchmarked buildings have lower median total gross floor area than neighborhoods with many benchmarked buildings.

5C) Visualization

ggplot(gb2, aes(x = reorder(neighborhood, mean_ghg_int), y = mean_ghg_int)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Average GHG Emissions Intensity by Neighborhood",
    x = "Neighborhood",
    y = "Mean GHG emissions intensity"
  )

6) Group By #3

Grouping buildings by construction era reveals a clear relationship between building age and Site Energy Use Intensity (EUI). Older building bins tend to show higher average Site EUI values compared to more recently constructed buildings.

Probability Interpretation

Each year-built bin represents a proportion of the dataset. The bin with the smallest number of records corresponds to the lowest probability of selecting a building from that construction era at random.

6A) Build the grouped summary

gb3 <- df1 %>%
  filter(!is.na(site_eui_k_btu_sf), !is.na(year_built_bin)) %>%
  group_by(year_built_bin) %>%
  summarise(
    n = n(),
    mean_site_eui = mean(site_eui_k_btu_sf, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  add_prob_and_tag(tag_n = 1)

gb3

## # A tibble: 7 × 5
##   year_built_bin          n mean_site_eui    prob rarity_tag       
##   <fct>               <int>         <dbl>   <dbl> <chr>            
## 1 (2.02e+03,2.04e+03]   151          36.1 0.00452 LOWEST_PROB_GROUP
## 2 (1.94e+03,1.96e+03]  3039          59.2 0.0909  other            
## 3 (1.92e+03,1.94e+03]  3304          71.5 0.0989  other            
## 4 [1.9e+03,1.92e+03]   3816          52.4 0.114   other            
## 5 (1.96e+03,1.98e+03]  7074          56.3 0.212   other            
## 6 (1.98e+03,2e+03]     7808          56.7 0.234   other            
## 7 (2e+03,2.02e+03]     8232          49.3 0.246   other

Significance

This pattern suggests that building age plays an important role in energy efficiency, likely due to differences in building codes, materials, and mechanical systems. The use of bins allows these trends to emerge clearly without being obscured by year-to-year variation.

Testable Hypothesis

Buildings constructed before 1950 have a higher mean Site EUI than buildings constructed after 2000.

gb3 %>% arrange(year_built_bin)

## # A tibble: 7 × 5
##   year_built_bin          n mean_site_eui    prob rarity_tag       
##   <fct>               <int>         <dbl>   <dbl> <chr>            
## 1 [1.9e+03,1.92e+03]   3816          52.4 0.114   other            
## 2 (1.92e+03,1.94e+03]  3304          71.5 0.0989  other            
## 3 (1.94e+03,1.96e+03]  3039          59.2 0.0909  other            
## 4 (1.96e+03,1.98e+03]  7074          56.3 0.212   other            
## 5 (1.98e+03,2e+03]     7808          56.7 0.234   other            
## 6 (2e+03,2.02e+03]     8232          49.3 0.246   other            
## 7 (2.02e+03,2.04e+03]   151          36.1 0.00452 LOWEST_PROB_GROUP

6C) Visualization

ggplot(gb3, aes(x = year_built_bin, y = mean_site_eui, group = 1)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Mean Site EUI by Year Built Bin",
    x = "Year built bin",
    y = "Mean Site EUI (kBtu/sf)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7) Two categorical variables

Analyzing combinations of building type and compliance status reveals how regulatory compliance varies across different categories of buildings. Some combinations occur frequently, while others are rare or entirely absent from the dataset. ### Missing Combinations Any missing building type–compliance status combinations may indicate that certain compliance categories are not applicable to specific building types or that regulatory requirements differ across categories.

Most and Least Common Combinations

The most common combinations represent typical building types that are frequently benchmarked and compliant, while the least common combinations represent edge cases or rare scenarios within the dataset.

7A) Build “all combinations” and find missing

pairs <- df1 %>%
  filter(!is.na(building_type), !is.na(compliance_status)) %>%
  distinct(building_type, compliance_status)

all_pairs <- df1 %>%
  distinct(building_type) %>%
  crossing(df1 %>% distinct(compliance_status))

missing_pairs <- all_pairs %>%
  anti_join(pairs, by = c("building_type", "compliance_status"))

missing_pairs

## # A tibble: 0 × 2
## # ℹ 2 variables: building_type <fct>, compliance_status <fct>

Significance

Understanding which combinations are rare or missing helps contextualize compliance patterns and may point to structural or policy-driven differences rather than data errors.

Testable Hypothesis

Non-compliant records are more prevalent among certain building types than others, even after accounting for differences in total record counts.

7B) Most/least common combinations

combo_counts <- df1 %>%
  filter(!is.na(building_type), !is.na(compliance_status)) %>%
  count(building_type, compliance_status, sort = TRUE) %>%
  mutate(prob = n / sum(n))

combo_counts %>% slice_head(n = 10)   # most common

## # A tibble: 10 × 4
##    building_type        compliance_status     n   prob
##    <fct>                <fct>             <int>  <dbl>
##  1 NonResidential       Compliant         12693 0.366 
##  2 Multifamily LR (1-4) Compliant          9920 0.286 
##  3 Multifamily MR (5-9) Compliant          6574 0.189 
##  4 Multifamily HR (10+) Compliant          1189 0.0343
##  5 NonResidential       Not Compliant       987 0.0284
##  6 SPS-District K-12    Compliant           825 0.0238
##  7 Nonresidential COS   Compliant           641 0.0185
##  8 Multifamily LR (1-4) Not Compliant       622 0.0179
##  9 Campus               Compliant           405 0.0117
## 10 Multifamily MR (5-9) Not Compliant       401 0.0116

combo_counts %>% slice_tail(n = 10)   # least common

## # A tibble: 10 × 4
##    building_type        compliance_status     n     prob
##    <fct>                <fct>             <int>    <dbl>
##  1 Nonresidential COS   Compliant           641 0.0185  
##  2 Multifamily LR (1-4) Not Compliant       622 0.0179  
##  3 Campus               Compliant           405 0.0117  
##  4 Multifamily MR (5-9) Not Compliant       401 0.0116  
##  5 Nonresidential WA    Compliant           207 0.00597 
##  6 SPS-District K-12    Not Compliant       113 0.00326 
##  7 Multifamily HR (10+) Not Compliant        86 0.00248 
##  8 Campus               Not Compliant        18 0.000519
##  9 Nonresidential COS   Not Compliant        10 0.000288
## 10 Nonresidential WA    Not Compliant         8 0.000231

7C) Visualize at least one combination

A clean, readable option is a heatmap tile plot:

ggplot(combo_counts, aes(x = building_type, y = compliance_status, fill = n)) +
  geom_tile() +
  labs(
    title = "Counts of Building Type × Compliance Status",
    x = "Building type",
    y = "Compliance status"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Week 3

Divya Kapoor

2026-02-01

Introduction

2) Minimal cleaning

Helper Function for Probability Calculation

4) Group By #1

4A) Build the grouped summary

Probability Interpretation

Significance

Testable Hypothesis

4C) Visualization

5) Group By #2

Probability Interpretation

5A) Build the grouped summary

Significance

Testable Hypothesis

5C) Visualization

6) Group By #3

Probability Interpretation

6A) Build the grouped summary

Significance

Testable Hypothesis

6C) Visualization

7) Two categorical variables

Most and Least Common Combinations

7A) Build “all combinations” and find missing

Significance

Testable Hypothesis

7B) Most/least common combinations

7C) Visualize at least one combination

Conclusion