Introduction

This analysis continues exploring the Building Energy Benchmarking dataset by focusing on sampling and variability in conclusions. Instead of analyzing the full dataset once, multiple subsamples are generated using sampling with replacement to simulate repeated data collection. Comparing summary statistics, anomalies, and category composition across samples helps evaluate how stable findings about Site Energy Use Intensity and related variables are under different sampling conditions. This approach highlights the importance of sample size and variability when drawing statistical conclusions.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(scales)
library(forcats)

data <- read_csv("/Users/divya/Desktop/IU/Statistics R Prog/Labs/Assignments/Building_Energy_Benchmarking_Data__2015-Present.csv") %>%
  clean_names()

# Optional: reuse your Week 3 minimal cleaning idea
df1 <- data %>%
  mutate(
    data_year = as.integer(data_year),
    year_built = as.integer(year_built),
    compliance_status = as.factor(compliance_status),
    largest_property_use_type = as.factor(largest_property_use_type)
  )

Build 3–7 random samples (with replacement)

I have created 5 subsamples using sampling with replacement, each containing 25% of the original dataset. Each subsample represents a plausible version of what the benchmarking dataset might look like if the same data collection process were repeated. This approach helps evaluate whether conculsions about energy intensity are stable or sensitive to sample variations.

set.seed(123)

sample_frac <- 0.25   # try 0.10, 0.25, 0.75 later
n_samples   <- 5      # any number from 3 to 7

df_samples <- tibble()

for (i in 1:n_samples) {
  df_i <- df1 %>%
    sample_n(size = floor(sample_frac * nrow(df1)), replace = TRUE) %>%
    mutate(sample_num = i)
  
  df_samples <- bind_rows(df_samples, df_i)
}

df_samples %>% count(sample_num)
## # A tibble: 5 × 2
##   sample_num     n
##        <int> <int>
## 1          1  8674
## 2          2  8674
## 3          3  8674
## 4          4  8674
## 5          5  8674

Quantitative anomaly detection

The anomaly cut off (the 95th percentile of Site EUI) and the resulting anomaly rate vary slightly across samples. Some subsamples contain a higher proportion of buildings above the high-EUI threshold, while others contain fewer. This shows that what appears unusually high in one sample may not stand out as much in another sample, even when using the same percentile-based definition. It emphasizes why identifying outliers or anomalies should be tied to a consistent rule and interpreted with awareness of sampling variability Further question to be asked: Would a robust approach (median/IQR) or a transformation provide more consistent identification of extreme energy users across repeated samples?

anomaly_summary <- df_samples %>%
  filter(!is.na(site_eui_k_btu_sf)) %>%
  group_by(sample_num) %>%
  mutate(p95 = quantile(site_eui_k_btu_sf, 0.95, na.rm = TRUE),
         is_anomaly = site_eui_k_btu_sf > p95) %>%
  summarise(
    n = n(),
    p95_cutoff = first(p95),
    anomaly_count = sum(is_anomaly),
    anomaly_rate = mean(is_anomaly),
    .groups = "drop"
  )

anomaly_summary
## # A tibble: 5 × 5
##   sample_num     n p95_cutoff anomaly_count anomaly_rate
##        <int> <int>      <dbl>         <int>        <dbl>
## 1          1  8361       139.           418       0.0500
## 2          2  8358       133.           418       0.0500
## 3          3  8369       134.           419       0.0501
## 4          4  8375       141.           418       0.0499
## 5          5  8335       142.           416       0.0499

Sample-size comparison (10% vs 25% vs 75%)

The 10% subsamples show the greatest variability in mean Site EUI across samples, while the 75% subsamples are much more consistent. As sample size increases, the sample means cluster more tightly. This supports a core sampling principle: larger samples produce more stable estimates and reduce the impact of random variation. Smaller samples are more sensitive to which observations are drawn and more likely to produce unstable conclusions. Further questions to be asked: what sample fraction is large enough for mean Site EUI to stabilize for decision making, and would using a more robust statistic (like the median) reduce sensitivity for small samples?

set.seed(123)

make_samples <- function(df, sample_frac, n_samples = 5) {
  bind_rows(lapply(1:n_samples, function(i) {
    df %>%
      sample_n(size = floor(sample_frac * nrow(df)), replace = TRUE) %>%
      mutate(sample_num = i, sample_frac = sample_frac)
  }))
}

summarise_eui <- function(df_samples) {
  df_samples %>%
    filter(!is.na(site_eui_k_btu_sf)) %>%
    group_by(sample_frac, sample_num) %>%
    summarise(
      mean_eui = mean(site_eui_k_btu_sf, na.rm = TRUE),
      median_eui = median(site_eui_k_btu_sf, na.rm = TRUE),
      sd_eui = sd(site_eui_k_btu_sf, na.rm = TRUE),
      .groups = "drop"
    )
}

df_s_10 <- make_samples(df1, 0.10, n_samples = 5)
df_s_25 <- make_samples(df1, 0.25, n_samples = 5)
df_s_75 <- make_samples(df1, 0.75, n_samples = 5)

eui_all <- bind_rows(df_s_10, df_s_25, df_s_75) %>%
  summarise_eui()

eui_all
## # A tibble: 15 × 5
##    sample_frac sample_num mean_eui median_eui sd_eui
##          <dbl>      <int>    <dbl>      <dbl>  <dbl>
##  1        0.1           1     53.6       37     57.4
##  2        0.1           2     52.9       38.2   65.8
##  3        0.1           3     56.2       36.8  173. 
##  4        0.1           4     53.0       36.8   93.4
##  5        0.1           5     66.4       36.9  769. 
##  6        0.25          1     53.6       37     88.1
##  7        0.25          2     58.4       37.2  486. 
##  8        0.25          3     53.6       36.6  102. 
##  9        0.25          4     51.8       36.6   58.1
## 10        0.25          5     55.2       36.6  226. 
## 11        0.75          1     59.3       36.9  516. 
## 12        0.75          2     56.0       36.8  315. 
## 13        0.75          3     55.5       36.7  314. 
## 14        0.75          4     55.9       36.9  199. 
## 15        0.75          5     54.2       36.9  151.

Compare Key numeric summaries across samples (Site EUI)

The mean, median, and standard deviation of Site EUI differ across the samples. Even though all samples are drawn from the same dataset, some subsamples show higher typical EUI values and some show more spread. This demonstrates that summary statistics can shift depending on which sample is observed. Because Site EUI can be influenced by extreme values and composition differences, relying on single sample could lead to slightly different conclusions about overall energy intensity. Further question to be asked: Are the differences driven primarily by outliers or by changes in the mix of building categories included in each subsample?

site_eui_by_sample <- df_samples %>%
  filter(!is.na(site_eui_k_btu_sf)) %>%
  group_by(sample_num) %>%
  summarise(
    n = n(),
    mean_eui = mean(site_eui_k_btu_sf, na.rm = TRUE),
    median_eui = median(site_eui_k_btu_sf, na.rm = TRUE),
    sd_eui = sd(site_eui_k_btu_sf, na.rm = TRUE),
    .groups = "drop"
  )

site_eui_by_sample
## # A tibble: 5 × 5
##   sample_num     n mean_eui median_eui sd_eui
##        <int> <int>    <dbl>      <dbl>  <dbl>
## 1          1  8361     53.2       37.4   60.4
## 2          2  8358     59.7       36.9  501. 
## 3          3  8369     53.6       37     88.1
## 4          4  8375     58.4       37.2  486. 
## 5          5  8335     53.6       36.6  102.

2B) Visual: distribution differences across samples


ggplot(eui_all, aes(x = factor(sample_frac), y = mean_eui)) +
  geom_boxplot() +
  labs(
    title = "Stability of Mean Site EUI Estimates by Sample Size",
    x = "Sample fraction",
    y = "Mean Site EUI (kBtu/sf)"
  ) +
  theme_minimal(base_size = 12)

Rare Property Types by Sample

The rarest property type are not identical across all subsamples. A building type that appears extremely rare in one sample may appear less rare in another. This indicates that rare group conclusions can be unstable when sample size is smaller, because a small change in which records are drawn can change group counts noticeably. It also affects interpretation of group level summary statistics since smaller groups can produce less stable averages Further questions to be asked: Should future analysis stratify by property type to reduce instability in group representation?

prop_counts <- df_samples %>%
  filter(!is.na(largest_property_use_type)) %>%
  mutate(largest_property_use_type = fct_lump_n(largest_property_use_type, n = 12)) %>%
  count(sample_num, largest_property_use_type, sort = TRUE) %>%
  group_by(sample_num) %>%
  mutate(prob_in_sample = n / sum(n)) %>%
  ungroup()

prop_counts %>% arrange(sample_num, prob_in_sample) %>% group_by(sample_num) %>% slice_head(n = 5)
## # A tibble: 25 × 4
## # Groups:   sample_num [5]
##    sample_num largest_property_use_type     n prob_in_sample
##         <int> <fct>                     <int>          <dbl>
##  1          1 Medical Office              101         0.0116
##  2          1 Senior Living Community     112         0.0129
##  3          1 Self-Storage Facility       123         0.0142
##  4          1 Distribution Center         134         0.0155
##  5          1 Worship Facility            163         0.0188
##  6          2 Senior Living Community     103         0.0119
##  7          2 Self-Storage Facility       110         0.0127
##  8          2 Medical Office              119         0.0137
##  9          2 Distribution Center         124         0.0143
## 10          2 Worship Facility            164         0.0189
## # ℹ 15 more rows

Consistency Check: correlation across samples

While the exact correlation value differs across samples, the relationship between Site EUI and GHG emissions intensity tends to remain in the same direction across subsamples. Consistent direction across samples increases confidence that the relationship is not driven by a single unusual draw. Even when estimates vary, stability in the sign suggests a real underlying association between energy intensity and emissions intensity. Further question to be asked: How much uncertainty exists around the correlation estimate, and would a confidence interval or bootstrap distribution of the correlation show meaningful overlap across different sample sizes?

corr_by_sample <- df_samples %>%
  filter(!is.na(site_eui_k_btu_sf), !is.na(ghg_emissions_intensity)) %>%
  group_by(sample_num) %>%
  summarise(corr = cor(site_eui_k_btu_sf, ghg_emissions_intensity, use = "complete.obs"),
            .groups = "drop")

corr_by_sample
## # A tibble: 5 × 2
##   sample_num  corr
##        <int> <dbl>
## 1          1 0.825
## 2          2 0.997
## 3          3 0.742
## 4          4 0.997
## 5          5 0.447

Consistency check: top property types repeated across samples

The most common property use types tend to repeat across the subsamples, even though their exact counts may change from sample to sample. This suggests that while proportions fluctuate, the dataset’s dominant categories are relatively stable. This stability helps explain why some high-level conclusion such as the identity of the most common building types remain consistent across resampled datasets. Further question can be asked: Do the dominant categories also show stable energy intensity patterns over time, or do they change depending on year or compliance status?

top_types_by_sample <- df_samples %>%
  filter(!is.na(largest_property_use_type)) %>%
  count(sample_num, largest_property_use_type, sort = TRUE) %>%
  group_by(sample_num) %>%
  slice_max(n, n = 3) %>%
  ungroup()

top_types_by_sample
## # A tibble: 15 × 3
##    sample_num largest_property_use_type      n
##         <int> <fct>                      <int>
##  1          1 Multifamily Housing         4516
##  2          1 Office                      1259
##  3          1 Non-Refrigerated Warehouse   405
##  4          2 Multifamily Housing         4609
##  5          2 Office                      1242
##  6          2 Non-Refrigerated Warehouse   396
##  7          3 Multifamily Housing         4644
##  8          3 Office                      1209
##  9          3 Non-Refrigerated Warehouse   427
## 10          4 Multifamily Housing         4582
## 11          4 Office                      1219
## 12          4 Non-Refrigerated Warehouse   415
## 13          5 Multifamily Housing         4613
## 14          5 Office                      1272
## 15          5 Non-Refrigerated Warehouse   425

Conclusion

Across bootstrap samples, estimates of Site EUI change moderately, with greater variability at smaller sample sizes and increased stability at larger sample sizes. This shows that conclusions based on a single dataset snapshot can be sensitive to sampling variation, especially when outliers and shifting category composition are present. However, some patterns remain consistent, such as the dominant property categories and the general direction of association between Site EUI and GHG emissions intensity. Moving Forward, I would be cautious about making strong claims from small samples or from groups with low counts, and I would rely more on robust statistics (median/IQR) and sensitivity checks. A next step would be to quantify uncertainty more explicitly and test whether conclusions hold when stratifying by key categories like property type.