This analysis uses the Building Energy Benchmarking Data (2015–Present) dataset, which contains detailed information on building characteristics, energy usage, greenhouse gas emissions, and compliance status for benchmarked buildings. The dataset includes both categorical variables (such as building type, neighborhood, and compliance status) and numerical variables (such as ENERGY STAR score, Site EUI, and GHG emissions intensity). To support grouped analysis and probability calculations, the data were cleaned using standardized column names and minimal type conversions. A binned version of year_built was created to enable comparisons across construction eras rather than individual years.
library(tidyverse)
library(janitor)
library(scales)
df <- read_csv("/Users/divya/Desktop/IU/Statistics R Prog/Labs/Assignments/Building_Energy_Benchmarking_Data__2015-Present.csv")
df <- clean_names(df)
glimpse(df)
## Rows: 34,699
## Columns: 46
## $ ose_building_id <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13,…
## $ data_year <dbl> 2024, 2024, 2024, 2024, 2024, 202…
## $ building_name <chr> "MAYFLOWER PARK HOTEL", "PARAMOUN…
## $ building_type <chr> "NonResidential", "NonResidential…
## $ tax_parcel_identification_number <chr> "659000030", "659000220", "659000…
## $ address <chr> "405 OLIVE WAY", "724 PINE ST", "…
## $ city <chr> "SEATTLE", "SEATTLE", "SEATTLE", …
## $ state <chr> "WA", "WA", "WA", "WA", "WA", "WA…
## $ zip_code <dbl> 98101, 98101, 98101, 98101, 98121…
## $ latitude <dbl> 47.61220, 47.61307, 47.61367, 47.…
## $ longitude <dbl> -122.3380, -122.3336, -122.3382, …
## $ neighborhood <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN…
## $ council_district_code <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, …
## $ year_built <dbl> 1927, 1996, 1969, 1926, 1980, 199…
## $ numberof_floors <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15,…
## $ numberof_buildings <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ property_gfa_total <dbl> 88434, 103566, 956110, 61320, 175…
## $ property_gfa_buildings <dbl> 88434, 88502, 759392, 61320, 1135…
## $ property_gfa_parking <dbl> 0, 15064, 196718, 0, 62000, 37198…
## $ self_report_gfa_total <dbl> 115387, 103566, 947059, 61320, 20…
## $ self_report_gfa_buildings <dbl> 115387, 88502, 827566, 61320, 123…
## $ self_report_parking <dbl> 0, 15064, 119493, 0, 80497, 40971…
## $ energystar_score <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 5…
## $ site_euiwn_k_btu_sf <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168…
## $ site_eui_k_btu_sf <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167…
## $ site_energy_use_k_btu <dbl> 7113958, 6330664, 67613264, 52739…
## $ site_energy_use_wn_k_btu <dbl> 7172158, 6362478, 67852608, 53463…
## $ source_euiwn_k_btu_sf <dbl> 122.9, 128.7, 171.8, 174.7, 167.6…
## $ source_eui_k_btu_sf <dbl> 121.4, 128.3, 171.5, 171.4, 167.2…
## $ epa_property_type <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type_gfa <dbl> 115387, 88502, 827566, 61320, 123…
## $ second_largest_property_use_type <chr> NA, "Parking", "Parking", NA, "Pa…
## $ second_largest_property_use_type_gfa <dbl> NA, 15064, 117783, NA, 68009, 409…
## $ third_largest_property_use_type <chr> NA, NA, "Swimming Pool", NA, "Swi…
## $ third_largest_property_use_type_gfa <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricity_k_wh <dbl> 1045040, 787838, 11279080, 796976…
## $ steam_use_k_btu <dbl> 1949686, NA, 23256386, 1389935, N…
## $ natural_gas_therms <dbl> 15986, 36426, 58726, 11648, 73811…
## $ compliance_status <chr> "Not Compliant", "Compliant", "Co…
## $ compliance_issue <chr> "Default Data", "No Issue", "No I…
## $ electricity_k_btu <dbl> 3565676, 2688104, 38484221, 27192…
## $ natural_gas_k_btu <dbl> 1598590, 3642560, 5872650, 116476…
## $ total_ghg_emissions <dbl> 263.3, 208.6, 2418.2, 190.1, 417.…
## $ ghg_emissions_intensity <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.8…
## $ demolished <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
skimr::skim(df)
| Name | df |
| Number of rows | 34699 |
| Number of columns | 46 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| logical | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| building_name | 0 | 1.00 | 3 | 70 | 0 | 3785 | 0 |
| building_type | 0 | 1.00 | 6 | 20 | 0 | 8 | 0 |
| tax_parcel_identification_number | 0 | 1.00 | 1 | 11 | 0 | 3691 | 0 |
| address | 0 | 1.00 | 9 | 33 | 0 | 3759 | 0 |
| city | 0 | 1.00 | 7 | 7 | 0 | 2 | 0 |
| state | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| neighborhood | 0 | 1.00 | 4 | 22 | 0 | 13 | 0 |
| epa_property_type | 21 | 1.00 | 5 | 52 | 0 | 71 | 0 |
| largest_property_use_type | 21 | 1.00 | 5 | 52 | 0 | 69 | 0 |
| second_largest_property_use_type | 14465 | 0.58 | 5 | 52 | 0 | 66 | 0 |
| third_largest_property_use_type | 26623 | 0.23 | 5 | 52 | 0 | 57 | 0 |
| compliance_status | 0 | 1.00 | 9 | 13 | 0 | 2 | 0 |
| compliance_issue | 459 | 0.99 | 8 | 61 | 0 | 12 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| demolished | 0 | 1 | 0.01 | FAL: 34370, TRU: 329 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ose_building_id | 0 | 1.00 | 23064.46 | 13665.32 | 1.00 | 20224.00 | 23454.00 | 26671.00 | 5.204600e+04 | ▂▂▇▁▂ |
| data_year | 0 | 1.00 | 2019.65 | 2.86 | 2015.00 | 2017.00 | 2020.00 | 2022.00 | 2.024000e+03 | ▇▇▇▇▇ |
| zip_code | 0 | 1.00 | 98116.54 | 16.74 | 98101.00 | 98105.00 | 98112.00 | 98122.00 | 9.819900e+04 | ▇▃▁▁▁ |
| latitude | 0 | 1.00 | 47.62 | 0.05 | 47.50 | 47.60 | 47.62 | 47.66 | 4.773000e+01 | ▁▂▇▃▂ |
| longitude | 0 | 1.00 | -122.33 | 0.03 | -122.41 | -122.35 | -122.33 | -122.32 | -1.222600e+02 | ▁▃▇▃▁ |
| council_district_code | 0 | 1.00 | 4.22 | 2.24 | 1.00 | 2.00 | 4.00 | 7.00 | 7.000000e+00 | ▆▃▂▂▇ |
| year_built | 0 | 1.00 | 1972.26 | 34.26 | 1900.00 | 1952.00 | 1979.00 | 2001.00 | 2.023000e+03 | ▃▃▆▇▇ |
| numberof_floors | 0 | 1.00 | 5.03 | 5.66 | 0.00 | 3.00 | 4.00 | 6.00 | 7.600000e+01 | ▇▁▁▁▁ |
| numberof_buildings | 374 | 0.99 | 1.27 | 5.74 | 1.00 | 1.00 | 1.00 | 1.00 | 3.000000e+02 | ▇▁▁▁▁ |
| property_gfa_total | 0 | 1.00 | 106983.36 | 300172.30 | 18481.00 | 29522.00 | 46527.00 | 98220.00 | 1.521647e+07 | ▇▁▁▁▁ |
| property_gfa_buildings | 0 | 1.00 | 90968.40 | 282479.55 | 12806.00 | 27500.00 | 43228.00 | 87513.00 | 1.521647e+07 | ▇▁▁▁▁ |
| property_gfa_parking | 0 | 1.00 | 14767.03 | 46665.34 | 0.00 | 0.00 | 0.00 | 5571.00 | 6.867500e+05 | ▇▁▁▁▁ |
| self_report_gfa_total | 0 | 1.00 | 108377.80 | 302851.80 | 0.00 | 29351.00 | 46677.00 | 100536.00 | 1.521647e+07 | ▇▁▁▁▁ |
| self_report_gfa_buildings | 0 | 1.00 | 90885.16 | 282718.77 | 0.00 | 27200.00 | 42952.00 | 87588.00 | 1.521647e+07 | ▇▁▁▁▁ |
| self_report_parking | 0 | 1.00 | 17492.64 | 49046.64 | 0.00 | 0.00 | 0.00 | 12000.00 | 7.867710e+05 | ▇▁▁▁▁ |
| energystar_score | 9285 | 0.73 | 72.36 | 25.49 | 1.00 | 59.00 | 80.00 | 93.00 | 1.000000e+02 | ▁▁▂▃▇ |
| site_euiwn_k_btu_sf | 1642 | 0.95 | 56.58 | 299.82 | 0.10 | 27.90 | 37.50 | 58.00 | 4.419800e+04 | ▇▁▁▁▁ |
| site_eui_k_btu_sf | 1275 | 0.96 | 55.88 | 294.91 | 0.10 | 27.50 | 36.90 | 57.00 | 4.419790e+04 | ▇▁▁▁▁ |
| site_energy_use_k_btu | 1268 | 0.96 | 9330387.13 | 562328180.72 | 433.00 | 945454.50 | 1845118.00 | 4290153.00 | 1.019950e+11 | ▇▁▁▁▁ |
| site_energy_use_wn_k_btu | 1634 | 0.95 | 9325697.04 | 565374694.75 | 433.00 | 956923.00 | 1869044.00 | 4326666.00 | 1.020130e+11 | ▇▁▁▁▁ |
| source_euiwn_k_btu_sf | 1642 | 0.95 | 123.20 | 345.69 | -2.10 | 69.50 | 89.00 | 125.20 | 4.646800e+04 | ▇▁▁▁▁ |
| source_eui_k_btu_sf | 1275 | 0.96 | 121.96 | 339.82 | -2.00 | 68.70 | 87.60 | 123.80 | 4.646780e+04 | ▇▁▁▁▁ |
| largest_property_use_type_gfa | 21 | 1.00 | 85121.83 | 245489.12 | 5100.00 | 25880.00 | 41256.00 | 81526.00 | 1.521647e+07 | ▇▁▁▁▁ |
| second_largest_property_use_type_gfa | 16191 | 0.53 | 33818.68 | 58935.77 | 1.00 | 6499.00 | 13252.00 | 34164.00 | 7.521630e+05 | ▇▁▁▁▁ |
| third_largest_property_use_type_gfa | 27474 | 0.21 | 14288.24 | 30452.70 | 35.00 | 3000.00 | 6071.00 | 12851.00 | 4.806250e+05 | ▇▁▁▁▁ |
| electricity_k_wh | 956 | 0.97 | 1080161.25 | 4784350.54 | 1.00 | 190533.00 | 349144.00 | 825645.00 | 2.807280e+08 | ▇▁▁▁▁ |
| steam_use_k_btu | 33504 | 0.03 | 26768064.20 | 318465114.72 | 1194.00 | 958647.50 | 2246909.00 | 5786526.00 | 7.060000e+09 | ▇▁▁▁▁ |
| natural_gas_therms | 13421 | 0.61 | 25253.70 | 192336.85 | 1.00 | 4154.25 | 9631.00 | 21425.00 | 2.392346e+07 | ▇▁▁▁▁ |
| electricity_k_btu | 956 | 0.97 | 3685510.20 | 16324204.03 | 3.00 | 650098.00 | 1191278.00 | 2817100.00 | 9.578439e+08 | ▇▁▁▁▁ |
| natural_gas_k_btu | 13414 | 0.61 | 2524538.99 | 19230576.90 | 4.00 | 415126.00 | 962770.00 | 2142000.00 | 2.392346e+09 | ▇▁▁▁▁ |
| total_ghg_emissions | 855 | 0.98 | 182.02 | 4986.97 | 0.10 | 7.90 | 32.50 | 93.60 | 5.816548e+05 | ▇▁▁▁▁ |
| ghg_emissions_intensity | 877 | 0.97 | 1.40 | 18.13 | 0.01 | 0.18 | 0.63 | 1.39 | 2.547820e+03 | ▇▁▁▁▁ |
Minimal data cleaning was applied to prepare the dataset for grouping and probability analysis. Key variables were converted to appropriate data types, construction years were binned into intervals for interpretability, and records with missing values in essential grouping fields were removed to ensure reliable group counts.
df1 <- df %>%
mutate(
data_year = as.integer(data_year),
year_built = as.integer(year_built),
council_district_code = as.factor(council_district_code),
demolished = as.factor(demolished),
compliance_status = as.factor(compliance_status),
building_type = as.factor(building_type),
neighborhood = as.factor(neighborhood),
epa_property_type = as.factor(epa_property_type),
# helpful binned variable using a cut_ function (required/encouraged)
year_built_bin = cut_interval(year_built, length = 20) # ~20-year bins
) %>%
filter(!is.na(building_type), !is.na(neighborhood), !is.na(data_year))
The helper function converts group counts into probabilities, enabling interpretation of each group’s relative likelihood within the dataset and systematically identifying the rarest group for further analysis.
library(tidyverse)
add_prob_and_tag <- function(group_df, n_col = "n", tag_n = 1) {
group_df %>%
mutate(
prob = .data[[n_col]] / sum(.data[[n_col]]),
rarity_tag = if_else(rank(.data[[n_col]], ties.method = "min") <= tag_n,
"LOWEST_PROB_GROUP", "other")
) %>%
arrange(.data[[n_col]])
}
This grouping examines how ENERGY STAR scores vary across different building types. The results show that building types differ substantially in both the number of benchmarked records and their average energy performance.
If a single building record is selected at random from the dataset, the probability that it belongs to a given building type is proportional to the number of records for that type. The building type with the smallest count represents the lowest-probability group, meaning it is the rarest building type in this dataset.
gb1 <- df1 %>%
filter(!is.na(energystar_score)) %>%
group_by(building_type) %>%
summarise(
n = n(),
mean_es = mean(energystar_score, na.rm = TRUE),
median_es = median(energystar_score, na.rm = TRUE),
.groups = "drop"
) %>%
add_prob_and_tag(tag_n = 1)
gb1
## # A tibble: 8 × 6
## building_type n mean_es median_es prob rarity_tag
## <fct> <int> <dbl> <dbl> <dbl> <chr>
## 1 Nonresidential COS 95 51.2 54 0.00374 LOWEST_PROB_GROUP
## 2 Nonresidential WA 106 51.3 55 0.00417 other
## 3 Campus 159 64.5 69 0.00626 other
## 4 SPS-District K-12 905 75.1 79 0.0356 other
## 5 Multifamily HR (10+) 1096 61.7 68 0.0431 other
## 6 Multifamily MR (5-9) 5638 83.3 92 0.222 other
## 7 Multifamily LR (1-4) 8515 75.1 81 0.335 other
## 8 NonResidential 8900 64.4 72 0.350 other
If a single building record is selected at random from the dataset, the probability that it belongs to a given building type is proportional to the number of records for that type. The building type with the smallest count represents the lowest-probability group, meaning it is the rarest building type in this dataset.
Understanding which building types are rare, it is important because small sample sizes can influence averages and may reflect structural differences in the city’s building stock or benchmarking requirements. Differences in ENERGY STAR performance by building type also highlight opportunities for targeted energy efficiency improvements.
Building types with fewer benchmarked records will have greater variability in ENERGY STAR scores than building types with large sample sizes.
ggplot(gb1, aes(x = reorder(building_type, n), y = n)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = comma) +
labs(
title = "Counts by Building Type (rarity = lowest probability group)",
x = "Building type",
y = "Number of records"
)
This analysis groups buildings by neighborhood to examine differences in average greenhouse gas emissions intensity. The results indicate that neighborhoods vary widely in both the number of benchmarked buildings and their emissions profiles.
The probability that a randomly selected record comes from a particular neighborhood depends on how many benchmarked buildings are present in that neighborhood. The neighborhood with the smallest count has the lowest probability of selection, making it the rarest neighborhood group in the dataset.
gb2 <- df1 %>%
filter(!is.na(ghg_emissions_intensity)) %>%
group_by(neighborhood) %>%
summarise(
n = n(),
mean_ghg_int = mean(ghg_emissions_intensity, na.rm = TRUE),
.groups = "drop"
) %>%
add_prob_and_tag(tag_n = 1)
gb2
## # A tibble: 13 × 5
## neighborhood n mean_ghg_int prob rarity_tag
## <fct> <int> <dbl> <dbl> <chr>
## 1 DELRIDGE NEIGHBORHOODS 802 2.16 0.0237 LOWEST_PROB_GROUP
## 2 SOUTHEAST 992 1.04 0.0293 other
## 3 BALLARD 1361 1.09 0.0402 other
## 4 CENTRAL 1367 1.14 0.0404 other
## 5 NORTH 1420 0.747 0.0420 other
## 6 SOUTHWEST 1686 1.13 0.0498 other
## 7 NORTHWEST 2526 1.07 0.0747 other
## 8 LAKE UNION 2791 1.20 0.0825 other
## 9 NORTHEAST 2862 1.25 0.0846 other
## 10 GREATER DUWAMISH 3439 2.01 0.102 other
## 11 MAGNOLIA / QUEEN ANNE 4312 1.02 0.127 other
## 12 EAST 4566 1.54 0.135 other
## 13 DOWNTOWN 5698 1.87 0.168 other
gb2 <- df1 %>%
filter(!is.na(ghg_emissions_intensity)) %>%
group_by(neighborhood) %>%
summarise(
n = n(),
mean_ghg_int = mean(ghg_emissions_intensity, na.rm = TRUE),
.groups = "drop"
) %>%
add_prob_and_tag(tag_n = 1)
gb2
## # A tibble: 13 × 5
## neighborhood n mean_ghg_int prob rarity_tag
## <fct> <int> <dbl> <dbl> <chr>
## 1 DELRIDGE NEIGHBORHOODS 802 2.16 0.0237 LOWEST_PROB_GROUP
## 2 SOUTHEAST 992 1.04 0.0293 other
## 3 BALLARD 1361 1.09 0.0402 other
## 4 CENTRAL 1367 1.14 0.0404 other
## 5 NORTH 1420 0.747 0.0420 other
## 6 SOUTHWEST 1686 1.13 0.0498 other
## 7 NORTHWEST 2526 1.07 0.0747 other
## 8 LAKE UNION 2791 1.20 0.0825 other
## 9 NORTHEAST 2862 1.25 0.0846 other
## 10 GREATER DUWAMISH 3439 2.01 0.102 other
## 11 MAGNOLIA / QUEEN ANNE 4312 1.02 0.127 other
## 12 EAST 4566 1.54 0.135 other
## 13 DOWNTOWN 5698 1.87 0.168 other
Neighborhood-level differences in emissions intensity may reflect variations in building age, usage patterns, infrastructure, or zoning policies. Identifying neighborhoods with higher emissions intensity can inform targeted sustainability initiatives.
Neighborhoods with fewer benchmarked buildings have lower median total gross floor area than neighborhoods with many benchmarked buildings.
ggplot(gb2, aes(x = reorder(neighborhood, mean_ghg_int), y = mean_ghg_int)) +
geom_col() +
coord_flip() +
labs(
title = "Average GHG Emissions Intensity by Neighborhood",
x = "Neighborhood",
y = "Mean GHG emissions intensity"
)
Grouping buildings by construction era reveals a clear relationship between building age and Site Energy Use Intensity (EUI). Older building bins tend to show higher average Site EUI values compared to more recently constructed buildings.
Each year-built bin represents a proportion of the dataset. The bin with the smallest number of records corresponds to the lowest probability of selecting a building from that construction era at random.
gb3 <- df1 %>%
filter(!is.na(site_eui_k_btu_sf), !is.na(year_built_bin)) %>%
group_by(year_built_bin) %>%
summarise(
n = n(),
mean_site_eui = mean(site_eui_k_btu_sf, na.rm = TRUE),
.groups = "drop"
) %>%
add_prob_and_tag(tag_n = 1)
gb3
## # A tibble: 7 × 5
## year_built_bin n mean_site_eui prob rarity_tag
## <fct> <int> <dbl> <dbl> <chr>
## 1 (2.02e+03,2.04e+03] 151 36.1 0.00452 LOWEST_PROB_GROUP
## 2 (1.94e+03,1.96e+03] 3039 59.2 0.0909 other
## 3 (1.92e+03,1.94e+03] 3304 71.5 0.0989 other
## 4 [1.9e+03,1.92e+03] 3816 52.4 0.114 other
## 5 (1.96e+03,1.98e+03] 7074 56.3 0.212 other
## 6 (1.98e+03,2e+03] 7808 56.7 0.234 other
## 7 (2e+03,2.02e+03] 8232 49.3 0.246 other
This pattern suggests that building age plays an important role in energy efficiency, likely due to differences in building codes, materials, and mechanical systems. The use of bins allows these trends to emerge clearly without being obscured by year-to-year variation.
Buildings constructed before 1950 have a higher mean Site EUI than buildings constructed after 2000.
gb3 %>% arrange(year_built_bin)
## # A tibble: 7 × 5
## year_built_bin n mean_site_eui prob rarity_tag
## <fct> <int> <dbl> <dbl> <chr>
## 1 [1.9e+03,1.92e+03] 3816 52.4 0.114 other
## 2 (1.92e+03,1.94e+03] 3304 71.5 0.0989 other
## 3 (1.94e+03,1.96e+03] 3039 59.2 0.0909 other
## 4 (1.96e+03,1.98e+03] 7074 56.3 0.212 other
## 5 (1.98e+03,2e+03] 7808 56.7 0.234 other
## 6 (2e+03,2.02e+03] 8232 49.3 0.246 other
## 7 (2.02e+03,2.04e+03] 151 36.1 0.00452 LOWEST_PROB_GROUP
ggplot(gb3, aes(x = year_built_bin, y = mean_site_eui, group = 1)) +
geom_line() +
geom_point() +
labs(
title = "Mean Site EUI by Year Built Bin",
x = "Year built bin",
y = "Mean Site EUI (kBtu/sf)"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Analyzing combinations of building type and compliance status reveals how regulatory compliance varies across different categories of buildings. Some combinations occur frequently, while others are rare or entirely absent from the dataset. ### Missing Combinations Any missing building type–compliance status combinations may indicate that certain compliance categories are not applicable to specific building types or that regulatory requirements differ across categories.
The most common combinations represent typical building types that are frequently benchmarked and compliant, while the least common combinations represent edge cases or rare scenarios within the dataset.
pairs <- df1 %>%
filter(!is.na(building_type), !is.na(compliance_status)) %>%
distinct(building_type, compliance_status)
all_pairs <- df1 %>%
distinct(building_type) %>%
crossing(df1 %>% distinct(compliance_status))
missing_pairs <- all_pairs %>%
anti_join(pairs, by = c("building_type", "compliance_status"))
missing_pairs
## # A tibble: 0 × 2
## # ℹ 2 variables: building_type <fct>, compliance_status <fct>
Understanding which combinations are rare or missing helps contextualize compliance patterns and may point to structural or policy-driven differences rather than data errors.
Non-compliant records are more prevalent among certain building types than others, even after accounting for differences in total record counts.
combo_counts <- df1 %>%
filter(!is.na(building_type), !is.na(compliance_status)) %>%
count(building_type, compliance_status, sort = TRUE) %>%
mutate(prob = n / sum(n))
combo_counts %>% slice_head(n = 10) # most common
## # A tibble: 10 × 4
## building_type compliance_status n prob
## <fct> <fct> <int> <dbl>
## 1 NonResidential Compliant 12693 0.366
## 2 Multifamily LR (1-4) Compliant 9920 0.286
## 3 Multifamily MR (5-9) Compliant 6574 0.189
## 4 Multifamily HR (10+) Compliant 1189 0.0343
## 5 NonResidential Not Compliant 987 0.0284
## 6 SPS-District K-12 Compliant 825 0.0238
## 7 Nonresidential COS Compliant 641 0.0185
## 8 Multifamily LR (1-4) Not Compliant 622 0.0179
## 9 Campus Compliant 405 0.0117
## 10 Multifamily MR (5-9) Not Compliant 401 0.0116
combo_counts %>% slice_tail(n = 10) # least common
## # A tibble: 10 × 4
## building_type compliance_status n prob
## <fct> <fct> <int> <dbl>
## 1 Nonresidential COS Compliant 641 0.0185
## 2 Multifamily LR (1-4) Not Compliant 622 0.0179
## 3 Campus Compliant 405 0.0117
## 4 Multifamily MR (5-9) Not Compliant 401 0.0116
## 5 Nonresidential WA Compliant 207 0.00597
## 6 SPS-District K-12 Not Compliant 113 0.00326
## 7 Multifamily HR (10+) Not Compliant 86 0.00248
## 8 Campus Not Compliant 18 0.000519
## 9 Nonresidential COS Not Compliant 10 0.000288
## 10 Nonresidential WA Not Compliant 8 0.000231
A clean, readable option is a heatmap tile plot:
ggplot(combo_counts, aes(x = building_type, y = compliance_status, fill = n)) +
geom_tile() +
labs(
title = "Counts of Building Type × Compliance Status",
x = "Building type",
y = "Compliance status"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Across all groupings, probability-based reasoning highlights how unevenly building records are distributed across categories. Group size plays a critical role in interpretation, as rare groups may exhibit more variability or reflect structural characteristics of the city rather than random noise. These grouped analyses provide a foundation for future statistical testing and deeper modeling of energy efficiency and emissions patterns.