Week 2

The purpose of this data dive is to gain an initial understanding of the Building Energy Benchmarking dataset by examining summary statistics and visualizations. This exploratory analysis helps identify patterns, variability, missing data, and relationships between building characteristics and energy performance. These insights inform future analysis and guide the formulation of research questions related to building energy efficiency.

Research Questions:
Based on the column summaries, data documentation, and project goals, the following questions were developed:

How does average energy use intensity vary across different property use types?
What is the distribution of ENERGY STAR scores across buildings, and how much data is missing for this metric?
Are certain property use types more likely to have missing energy performance data than others?

Numerical Analysis

Two key numeric variables were examined:

ENERGY STAR Score: A standardized efficiency score ranging from 1 to 100, where higher values indicate better energy performance.
Site Energy Use Intensity (Site EUI, kBtu/sf): Measures total energy use per square foot, where higher values indicate greater energy consumption intensity.
Summary statistics including minimum, maximum, mean, median, and quartiles were computed for both variables.

Insights Gained

ENERGY STAR scores show a wide range, indicating substantial variation in building efficiency.
Site EUI values are right-skewed, with some buildings consuming significantly more energy per square foot than others.
The difference between quartiles suggests that energy consumption is not evenly distributed across buildings.

Significance

Understanding the distribution and spread of these variables is critical for identifying inefficient buildings and for contextualizing comparisons across building types.

library(conflicted)

conflicts_prefer(
  dplyr::filter,
  dplyr::lag
)

library(tidyverse)
library(lubridate)
library(skimr)
data <- read_csv("/Users/divya/Desktop/IU/Statistics R Prog/Labs/Assignments/Building_Energy_Benchmarking_Data__2015-Present.csv")

names(data)[stringr::str_detect(names(data), "STAR|star|EUI|eui")]

## [1] "ENERGYSTARScore"      "SiteEUIWN(kBtu/sf)"   "SiteEUI(kBtu/sf)"    
## [4] "SourceEUIWN(kBtu/sf)" "SourceEUI(kBtu/sf)"

data %>%
  summarise(
    energy_star_min = min(`ENERGYSTARScore`, na.rm = TRUE),
    energy_star_q1 = quantile(`ENERGYSTARScore`, 0.25, na.rm = TRUE),
    energy_star_median = median(`ENERGYSTARScore`, na.rm = TRUE),
    energy_star_mean = mean(`ENERGYSTARScore`, na.rm = TRUE),
    energy_star_q3 = quantile(`ENERGYSTARScore`, 0.75, na.rm = TRUE),
    energy_star_max = max(`ENERGYSTARScore`, na.rm = TRUE),

    eui_min = min(`SiteEUI(kBtu/sf)`, na.rm = TRUE),
    eui_q1 = quantile(`SiteEUI(kBtu/sf)`, 0.25, na.rm = TRUE),
    eui_median = median(`SiteEUI(kBtu/sf)`, na.rm = TRUE),
    eui_mean = mean(`SiteEUI(kBtu/sf)`, na.rm = TRUE),
    eui_q3 = quantile(`SiteEUI(kBtu/sf)`, 0.75, na.rm = TRUE),
    eui_max = max(`SiteEUI(kBtu/sf)`, na.rm = TRUE)
  )

## # A tibble: 1 × 12
##   energy_star_min energy_star_q1 energy_star_median energy_star_mean
##             <dbl>          <dbl>              <dbl>            <dbl>
## 1               1             59                 80             72.4
## # ℹ 8 more variables: energy_star_q3 <dbl>, energy_star_max <dbl>,
## #   eui_min <dbl>, eui_q1 <dbl>, eui_median <dbl>, eui_mean <dbl>,
## #   eui_q3 <dbl>, eui_max <dbl>

library(janitor)
data <- clean_names(data)

Categorical Data Summaries

The categorical variable largest_property_use_type was summarized by counting unique values and their frequencies.

Insights Gained

Certain property use types (such as offices and residential buildings) appear far more frequently than others.
Some property categories have relatively small sample sizes, which may affect the stability of summary statistics for those groups.

Significance

This information helps determine which categories are well-represented and which may need to be grouped or excluded in later analyses.

{names(data)[stringr::str_detect(names(data), "property|use|type")]}

data %>%
  count(largest_property_use_type, sort = TRUE)

## # A tibble: 70 × 2
##    largest_property_use_type      n
##    <chr>                      <int>
##  1 Multifamily Housing        18416
##  2 Office                      4988
##  3 Non-Refrigerated Warehouse  1653
##  4 K-12 School                 1363
##  5 Retail Store                 827
##  6 Hotel                        804
##  7 Worship Facility             657
##  8 Other                        655
##  9 Distribution Center          473
## 10 Medical Office               453
## # ℹ 60 more rows

Aggregation Analysis

To address the first question, average Site EUI was calculated by grouping buildings by their largest property use type.

Insights Gained

Average energy use intensity varies substantially across property types.
Energy-intensive uses such as laboratories or healthcare-related buildings tend to have higher average EUI values compared to offices or residential buildings.

Significance

This confirms that building function is a major driver of energy consumption and should be accounted for when comparing building performance or designing efficiency interventions.

data %>%
  group_by(largest_property_use_type) %>%
  summarise(
    avg_eui = mean(site_eui_k_btu_sf, na.rm = TRUE),
    n = n()
  ) %>%
  arrange(desc(avg_eui))

## # A tibble: 70 × 3
##    largest_property_use_type             avg_eui     n
##    <chr>                                   <dbl> <int>
##  1 Data Center                              862.    34
##  2 Other - Services                         535.    35
##  3 Residential Care Facility                363.    27
##  4 Supermarket/Grocery Store                229.   373
##  5 Laboratory                               227.   212
##  6 Hospital (General Medical & Surgical)    198.   109
##  7 Restaurant                               174.   123
##  8 Medical Office                           172.   453
##  9 Other/Specialty Hospital                 164.    41
## 10 Other                                    163.   655
## # ℹ 60 more rows

Distribution of ENERGY STAR Scores

A histogram was used to visualize the distribution of ENERGY STAR scores.

Insights

Scores cluster in the mid-range, with fewer buildings achieving very high or very low scores.
A substantial number of buildings lack ENERGY STAR scores, as shown by the missing data analysis.

Significance

This highlights both overall efficiency trends and data limitations that may affect interpretation.

ggplot(data, aes(x = energystar_score)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of ENERGY STAR Scores",
    x = "ENERGY STAR Score",
    y = "Count"
  )

## Warning: Removed 9285 rows containing non-finite outside the scale range
## (`stat_bin()`).

data %>%
  summarise(
    missing_energy_star = sum(is.na(energystar_score)),
    pct_missing = mean(is.na(energystar_score)) * 100
  )

## # A tibble: 1 × 2
##   missing_energy_star pct_missing
##                 <int>       <dbl>
## 1                9285        26.8

Energy Use Intensity by Property Use Type

A boxplot was used to visualize Site EUI across property use types, with color used to distinguish categories.

Insights

Energy use intensity differs markedly by property use type.
Some categories show wide variability and notable outliers, indicating inconsistent energy performance within the same building function.

Significance

Visualizing interactions between categorical and continuous variables reveals heterogeneity that would be missed in aggregate summaries alone.

library(tidyverse)
library(janitor)
library(forcats)
library(scales)
library(plotly)
conflicts_prefer(plotly::layout)

## [conflicted] Will prefer plotly::layout over any other package.

plot_df <- data %>%
  filter(
    !is.na(largest_property_use_type),
    !is.na(site_eui_k_btu_sf),
    !is.na(compliance_status)
  ) %>%
  mutate(
    largest_property_use_type = forcats::fct_lump_n(largest_property_use_type, n = 12),
    largest_property_use_type = forcats::fct_reorder(
      largest_property_use_type,
      site_eui_k_btu_sf,
      .fun = median,
      na.rm = TRUE
    )
  )

p <- ggplot(plot_df, aes(
  x = largest_property_use_type,
  y = site_eui_k_btu_sf,
  fill = compliance_status,
  group = interaction(largest_property_use_type, compliance_status),
  text = paste0(
    "Property Type: ", largest_property_use_type,
    "<br>Compliance: ", compliance_status,
    "<br>Site EUI: ", round(site_eui_k_btu_sf, 2)
  )
)) +
  geom_boxplot(outlier.alpha = 0.25, position = position_dodge(width = 0.8)) +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_viridis_d() +
  labs(
    title = "Site EUI by Property Use Type and Compliance Status",
    x = "Property Use Type",
    y = "Site EUI (kBtu/sf)",
    fill = "Compliance Status"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title.position = "plot",
    plot.margin = margin(10, 20, 10, 10),
    axis.text.y = element_text(size = 9)
  )

ggplotly(p, tooltip = "text") %>%
  layout(
    title = list(
      text = "Site EUI by Property Use Type and Compliance Status",
      font = list(size = 20),
      x = 0.5
    )
  )

## Warning: The following aesthetics were dropped during statistical transformation: text.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Trends in Mean Site Energy Use Intensity by Property Use-Type

How has average energy use per square foot changed over time for different types of buildings? By grouping by both year and property use type, the chart highlights whether energy efficiency is improving, worsening, or remaining stable across building categories. This plot makes it possible to compare trends across categories, rather than just overall averages. Differences in slope between lines indicate that some property types may be improving in energy efficiency faster than others, while convergence or divergence over time may reflect changes in building codes, technology adoption, or operational practices. This visualization shows how average Site EUI has changed over time for the most common property use types, allowing comparison of energy efficiency trends across building categories.

trend_df <- data %>%
  filter(!is.na(data_year), !is.na(site_eui_k_btu_sf), !is.na(largest_property_use_type)) %>%
  mutate(largest_property_use_type = fct_lump_n(largest_property_use_type, n = 6)) %>%
  group_by(data_year, largest_property_use_type) %>%
  summarise(mean_eui = mean(site_eui_k_btu_sf, na.rm = TRUE), n = n(), .groups = "drop")

ggplot(trend_df, aes(x = data_year, y = mean_eui, color = largest_property_use_type)) +
  geom_line(linewidth = 1) +
  geom_point() +
  labs(
    title = "Trend in Mean Site EUI Over Time (Top 6 Property Types)",
    x = "Data Year",
    y = "Mean Site EUI (kBtu/sf)",
    color = "Property Type"
  ) +
  theme_minimal(base_size = 12)

Relationship Between Site EUI and GHG Emissions Intensity

This visualization shows the relationship between Site Energy Use Intensity and GHG emissions intensity across different property use types. The upward trends indicate that buildings with higher energy use per square foot generally produce higher emissions. Differences in trend lines across property types suggest that the strength of this relationship varies by building function. This highlights the importance of considering both energy intensity and building use type when analyzing emissions patterns.

corr_df <- data %>%
  filter(!is.na(site_eui_k_btu_sf), !is.na(ghg_emissions_intensity), !is.na(largest_property_use_type)) %>%
  mutate(largest_property_use_type = fct_lump_n(largest_property_use_type, n = 6))

ggplot(corr_df, aes(x = site_eui_k_btu_sf, y = ghg_emissions_intensity, color = largest_property_use_type)) +
  geom_point(alpha = 0.25) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Relationship Between Site EUI and GHG Emissions Intensity",
    x = "Site EUI (kBtu/sf)",
    y = "GHG Emissions Intensity",
    color = "Property Type"
  ) +
  theme_minimal(base_size = 12)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Missing Data Assessment

Missing data was quantified for ENERGY STAR scores, Site EUI, and property use type.

Insights Gained

A notable percentage of buildings are missing ENERGY STAR scores.
Site EUI has fewer missing values, making it a more reliable variable for certain analyses.

Significance

Missing data can bias conclusions and should be explicitly accounted for in future analyses, either through filtering, imputation, or sensitivity checks.

data %>%
  summarise(
    total_rows = n(),
    missing_use_type = sum(is.na(largest_property_use_type)),
    missing_site_eui = sum(is.na(site_eui_k_btu_sf)),
    pct_missing_site_eui = mean(is.na(site_eui_k_btu_sf)) * 100
  )

## # A tibble: 1 × 4
##   total_rows missing_use_type missing_site_eui pct_missing_site_eui
##        <int>            <int>            <int>                <dbl>
## 1      34699               21             1275                 3.67

Conclusion and Next Steps

This data dive revealed substantial variability in building energy performance, clear differences across property use types, and important data quality considerations. Future work will explore trends over time, correlations between efficiency metrics, and multivariate models that account for building characteristics such as size, use, and location.

Week 2

Divya Kapoor

01-26-2026

Numerical Analysis

Insights Gained

Significance

Categorical Data Summaries

Insights Gained

Significance

Aggregation Analysis

Insights Gained

Significance

Distribution of ENERGY STAR Scores

Energy Use Intensity by Property Use Type

Trends in Mean Site Energy Use Intensity by Property Use-Type

Relationship Between Site EUI and GHG Emissions Intensity

Missing Data Assessment

Insights Gained

Significance

Conclusion and Next Steps