The purpose of this data dive is to gain an initial understanding of the Building Energy Benchmarking dataset by examining summary statistics and visualizations. This exploratory analysis helps identify patterns, variability, missing data, and relationships between building characteristics and energy performance. These insights inform future analysis and guide the formulation of research questions related to building energy efficiency.
Research Questions:
Based on the column summaries, data documentation, and project goals,
the following questions were developed:
How does average energy use intensity vary across different property use types?
What is the distribution of ENERGY STAR scores across buildings, and how much data is missing for this metric?
Are certain property use types more likely to have missing energy performance data than others?
Two key numeric variables were examined:
ENERGY STAR Score: A standardized efficiency score ranging from 1 to 100, where higher values indicate better energy performance.
Site Energy Use Intensity (Site EUI, kBtu/sf): Measures total energy use per square foot, where higher values indicate greater energy consumption intensity.
Summary statistics including minimum, maximum, mean, median, and quartiles were computed for both variables.
ENERGY STAR scores show a wide range, indicating substantial variation in building efficiency.
Site EUI values are right-skewed, with some buildings consuming significantly more energy per square foot than others.
The difference between quartiles suggests that energy consumption is not evenly distributed across buildings.
Understanding the distribution and spread of these variables is critical for identifying inefficient buildings and for contextualizing comparisons across building types.
library(conflicted)
conflicts_prefer(
dplyr::filter,
dplyr::lag
)
library(tidyverse)
library(lubridate)
library(skimr)
data <- read_csv("/Users/divya/Desktop/IU/Statistics R Prog/Labs/Assignments/Building_Energy_Benchmarking_Data__2015-Present.csv")
names(data)[stringr::str_detect(names(data), "STAR|star|EUI|eui")]
## [1] "ENERGYSTARScore" "SiteEUIWN(kBtu/sf)" "SiteEUI(kBtu/sf)"
## [4] "SourceEUIWN(kBtu/sf)" "SourceEUI(kBtu/sf)"
data %>%
summarise(
energy_star_min = min(`ENERGYSTARScore`, na.rm = TRUE),
energy_star_q1 = quantile(`ENERGYSTARScore`, 0.25, na.rm = TRUE),
energy_star_median = median(`ENERGYSTARScore`, na.rm = TRUE),
energy_star_mean = mean(`ENERGYSTARScore`, na.rm = TRUE),
energy_star_q3 = quantile(`ENERGYSTARScore`, 0.75, na.rm = TRUE),
energy_star_max = max(`ENERGYSTARScore`, na.rm = TRUE),
eui_min = min(`SiteEUI(kBtu/sf)`, na.rm = TRUE),
eui_q1 = quantile(`SiteEUI(kBtu/sf)`, 0.25, na.rm = TRUE),
eui_median = median(`SiteEUI(kBtu/sf)`, na.rm = TRUE),
eui_mean = mean(`SiteEUI(kBtu/sf)`, na.rm = TRUE),
eui_q3 = quantile(`SiteEUI(kBtu/sf)`, 0.75, na.rm = TRUE),
eui_max = max(`SiteEUI(kBtu/sf)`, na.rm = TRUE)
)
## # A tibble: 1 × 12
## energy_star_min energy_star_q1 energy_star_median energy_star_mean
## <dbl> <dbl> <dbl> <dbl>
## 1 1 59 80 72.4
## # ℹ 8 more variables: energy_star_q3 <dbl>, energy_star_max <dbl>,
## # eui_min <dbl>, eui_q1 <dbl>, eui_median <dbl>, eui_mean <dbl>,
## # eui_q3 <dbl>, eui_max <dbl>
library(janitor)
data <- clean_names(data)
The categorical variable largest_property_use_type was summarized by counting unique values and their frequencies.
Certain property use types (such as offices and residential buildings) appear far more frequently than others.
Some property categories have relatively small sample sizes, which may affect the stability of summary statistics for those groups.
This information helps determine which categories are well-represented and which may need to be grouped or excluded in later analyses.
{names(data)[stringr::str_detect(names(data), "property|use|type")]}
data %>%
count(largest_property_use_type, sort = TRUE)
## # A tibble: 70 × 2
## largest_property_use_type n
## <chr> <int>
## 1 Multifamily Housing 18416
## 2 Office 4988
## 3 Non-Refrigerated Warehouse 1653
## 4 K-12 School 1363
## 5 Retail Store 827
## 6 Hotel 804
## 7 Worship Facility 657
## 8 Other 655
## 9 Distribution Center 473
## 10 Medical Office 453
## # ℹ 60 more rows
To address the first question, average Site EUI was calculated by grouping buildings by their largest property use type.
Average energy use intensity varies substantially across property types.
Energy-intensive uses such as laboratories or healthcare-related buildings tend to have higher average EUI values compared to offices or residential buildings.
This confirms that building function is a major driver of energy consumption and should be accounted for when comparing building performance or designing efficiency interventions.
data %>%
group_by(largest_property_use_type) %>%
summarise(
avg_eui = mean(site_eui_k_btu_sf, na.rm = TRUE),
n = n()
) %>%
arrange(desc(avg_eui))
## # A tibble: 70 × 3
## largest_property_use_type avg_eui n
## <chr> <dbl> <int>
## 1 Data Center 862. 34
## 2 Other - Services 535. 35
## 3 Residential Care Facility 363. 27
## 4 Supermarket/Grocery Store 229. 373
## 5 Laboratory 227. 212
## 6 Hospital (General Medical & Surgical) 198. 109
## 7 Restaurant 174. 123
## 8 Medical Office 172. 453
## 9 Other/Specialty Hospital 164. 41
## 10 Other 163. 655
## # ℹ 60 more rows
A histogram was used to visualize the distribution of ENERGY STAR scores.
Insights
Scores cluster in the mid-range, with fewer buildings achieving very high or very low scores.
A substantial number of buildings lack ENERGY STAR scores, as shown by the missing data analysis.
Significance
This highlights both overall efficiency trends and data limitations that may affect interpretation.
ggplot(data, aes(x = energystar_score)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(
title = "Distribution of ENERGY STAR Scores",
x = "ENERGY STAR Score",
y = "Count"
)
## Warning: Removed 9285 rows containing non-finite outside the scale range
## (`stat_bin()`).
data %>%
summarise(
missing_energy_star = sum(is.na(energystar_score)),
pct_missing = mean(is.na(energystar_score)) * 100
)
## # A tibble: 1 × 2
## missing_energy_star pct_missing
## <int> <dbl>
## 1 9285 26.8
A boxplot was used to visualize Site EUI across property use types, with color used to distinguish categories.
Insights
Energy use intensity differs markedly by property use type.
Some categories show wide variability and notable outliers, indicating inconsistent energy performance within the same building function.
Significance
Visualizing interactions between categorical and continuous variables reveals heterogeneity that would be missed in aggregate summaries alone.
library(tidyverse)
library(janitor)
library(forcats)
library(scales)
library(plotly)
conflicts_prefer(plotly::layout)
## [conflicted] Will prefer plotly::layout over any other package.
plot_df <- data %>%
filter(
!is.na(largest_property_use_type),
!is.na(site_eui_k_btu_sf),
!is.na(compliance_status)
) %>%
mutate(
largest_property_use_type = forcats::fct_lump_n(largest_property_use_type, n = 12),
largest_property_use_type = forcats::fct_reorder(
largest_property_use_type,
site_eui_k_btu_sf,
.fun = median,
na.rm = TRUE
)
)
p <- ggplot(plot_df, aes(
x = largest_property_use_type,
y = site_eui_k_btu_sf,
fill = compliance_status,
group = interaction(largest_property_use_type, compliance_status),
text = paste0(
"Property Type: ", largest_property_use_type,
"<br>Compliance: ", compliance_status,
"<br>Site EUI: ", round(site_eui_k_btu_sf, 2)
)
)) +
geom_boxplot(outlier.alpha = 0.25, position = position_dodge(width = 0.8)) +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
scale_fill_viridis_d() +
labs(
title = "Site EUI by Property Use Type and Compliance Status",
x = "Property Use Type",
y = "Site EUI (kBtu/sf)",
fill = "Compliance Status"
) +
theme_minimal(base_size = 12) +
theme(
plot.title.position = "plot",
plot.margin = margin(10, 20, 10, 10),
axis.text.y = element_text(size = 9)
)
ggplotly(p, tooltip = "text") %>%
layout(
title = list(
text = "Site EUI by Property Use Type and Compliance Status",
font = list(size = 20),
x = 0.5
)
)
## Warning: The following aesthetics were dropped during statistical transformation: text.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
How has average energy use per square foot changed over time for different types of buildings? By grouping by both year and property use type, the chart highlights whether energy efficiency is improving, worsening, or remaining stable across building categories. This plot makes it possible to compare trends across categories, rather than just overall averages. Differences in slope between lines indicate that some property types may be improving in energy efficiency faster than others, while convergence or divergence over time may reflect changes in building codes, technology adoption, or operational practices. This visualization shows how average Site EUI has changed over time for the most common property use types, allowing comparison of energy efficiency trends across building categories.
trend_df <- data %>%
filter(!is.na(data_year), !is.na(site_eui_k_btu_sf), !is.na(largest_property_use_type)) %>%
mutate(largest_property_use_type = fct_lump_n(largest_property_use_type, n = 6)) %>%
group_by(data_year, largest_property_use_type) %>%
summarise(mean_eui = mean(site_eui_k_btu_sf, na.rm = TRUE), n = n(), .groups = "drop")
ggplot(trend_df, aes(x = data_year, y = mean_eui, color = largest_property_use_type)) +
geom_line(linewidth = 1) +
geom_point() +
labs(
title = "Trend in Mean Site EUI Over Time (Top 6 Property Types)",
x = "Data Year",
y = "Mean Site EUI (kBtu/sf)",
color = "Property Type"
) +
theme_minimal(base_size = 12)
This visualization shows the relationship between Site Energy Use Intensity and GHG emissions intensity across different property use types. The upward trends indicate that buildings with higher energy use per square foot generally produce higher emissions. Differences in trend lines across property types suggest that the strength of this relationship varies by building function. This highlights the importance of considering both energy intensity and building use type when analyzing emissions patterns.
corr_df <- data %>%
filter(!is.na(site_eui_k_btu_sf), !is.na(ghg_emissions_intensity), !is.na(largest_property_use_type)) %>%
mutate(largest_property_use_type = fct_lump_n(largest_property_use_type, n = 6))
ggplot(corr_df, aes(x = site_eui_k_btu_sf, y = ghg_emissions_intensity, color = largest_property_use_type)) +
geom_point(alpha = 0.25) +
geom_smooth(se = FALSE) +
labs(
title = "Relationship Between Site EUI and GHG Emissions Intensity",
x = "Site EUI (kBtu/sf)",
y = "GHG Emissions Intensity",
color = "Property Type"
) +
theme_minimal(base_size = 12)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Missing data was quantified for ENERGY STAR scores, Site EUI, and property use type.
A notable percentage of buildings are missing ENERGY STAR scores.
Site EUI has fewer missing values, making it a more reliable variable for certain analyses.
Missing data can bias conclusions and should be explicitly accounted for in future analyses, either through filtering, imputation, or sensitivity checks.
data %>%
summarise(
total_rows = n(),
missing_use_type = sum(is.na(largest_property_use_type)),
missing_site_eui = sum(is.na(site_eui_k_btu_sf)),
pct_missing_site_eui = mean(is.na(site_eui_k_btu_sf)) * 100
)
## # A tibble: 1 × 4
## total_rows missing_use_type missing_site_eui pct_missing_site_eui
## <int> <int> <int> <dbl>
## 1 34699 21 1275 3.67
This data dive revealed substantial variability in building energy performance, clear differences across property use types, and important data quality considerations. Future work will explore trends over time, correlations between efficiency metrics, and multivariate models that account for building characteristics such as size, use, and location.