The purpose of this data dive is to gain an initial understanding of the Building Energy Benchmarking dataset by examining summary statistics and visualizations. This exploratory analysis helps identify patterns, variability, missing data, and relationships between building characteristics and energy performance. These insights inform future analysis and guide the formulation of research questions related to building energy efficiency.
Research Questions:
Based on the column summaries, data documentation, and project goals,
the following questions were developed:
How does average energy use intensity vary across different property use types?
What is the distribution of ENERGY STAR scores across buildings, and how much data is missing for this metric?
Are certain property use types more likely to have missing energy performance data than others?
Two key numeric variables were examined:
ENERGY STAR Score: A standardized efficiency score ranging from 1 to 100, where higher values indicate better energy performance.
Site Energy Use Intensity (Site EUI, kBtu/sf): Measures total energy use per square foot, where higher values indicate greater energy consumption intensity.
Summary statistics including minimum, maximum, mean, median, and quartiles were computed for both variables.
ENERGY STAR scores show a wide range, indicating substantial variation in building efficiency.
Site EUI values are right-skewed, with some buildings consuming significantly more energy per square foot than others.
The difference between quartiles suggests that energy consumption is not evenly distributed across buildings.
Understanding the distribution and spread of these variables is critical for identifying inefficient buildings and for contextualizing comparisons across building types.
library(conflicted)
conflicts_prefer(
dplyr::filter,
dplyr::lag
)
library(tidyverse)
library(lubridate)
library(skimr)
data <- read_csv("/Users/divya/Desktop/IU/Statistics R Prog/Labs/Assignments/Building_Energy_Benchmarking_Data__2015-Present.csv")
names(data)[stringr::str_detect(names(data), "STAR|star|EUI|eui")]
## [1] "ENERGYSTARScore" "SiteEUIWN(kBtu/sf)" "SiteEUI(kBtu/sf)"
## [4] "SourceEUIWN(kBtu/sf)" "SourceEUI(kBtu/sf)"
data %>%
summarise(
energy_star_min = min(`ENERGYSTARScore`, na.rm = TRUE),
energy_star_q1 = quantile(`ENERGYSTARScore`, 0.25, na.rm = TRUE),
energy_star_median = median(`ENERGYSTARScore`, na.rm = TRUE),
energy_star_mean = mean(`ENERGYSTARScore`, na.rm = TRUE),
energy_star_q3 = quantile(`ENERGYSTARScore`, 0.75, na.rm = TRUE),
energy_star_max = max(`ENERGYSTARScore`, na.rm = TRUE),
eui_min = min(`SiteEUI(kBtu/sf)`, na.rm = TRUE),
eui_q1 = quantile(`SiteEUI(kBtu/sf)`, 0.25, na.rm = TRUE),
eui_median = median(`SiteEUI(kBtu/sf)`, na.rm = TRUE),
eui_mean = mean(`SiteEUI(kBtu/sf)`, na.rm = TRUE),
eui_q3 = quantile(`SiteEUI(kBtu/sf)`, 0.75, na.rm = TRUE),
eui_max = max(`SiteEUI(kBtu/sf)`, na.rm = TRUE)
)
## # A tibble: 1 × 12
## energy_star_min energy_star_q1 energy_star_median energy_star_mean
## <dbl> <dbl> <dbl> <dbl>
## 1 1 59 80 72.4
## # ℹ 8 more variables: energy_star_q3 <dbl>, energy_star_max <dbl>,
## # eui_min <dbl>, eui_q1 <dbl>, eui_median <dbl>, eui_mean <dbl>,
## # eui_q3 <dbl>, eui_max <dbl>
library(janitor)
data <- clean_names(data)
The categorical variable largest_property_use_type was summarized by counting unique values and their frequencies.
Certain property use types (such as offices and residential buildings) appear far more frequently than others.
Some property categories have relatively small sample sizes, which may affect the stability of summary statistics for those groups.
This information helps determine which categories are well-represented and which may need to be grouped or excluded in later analyses.
{names(data)[stringr::str_detect(names(data), "property|use|type")]}
data %>%
count(largest_property_use_type, sort = TRUE)
## # A tibble: 70 × 2
## largest_property_use_type n
## <chr> <int>
## 1 Multifamily Housing 18416
## 2 Office 4988
## 3 Non-Refrigerated Warehouse 1653
## 4 K-12 School 1363
## 5 Retail Store 827
## 6 Hotel 804
## 7 Worship Facility 657
## 8 Other 655
## 9 Distribution Center 473
## 10 Medical Office 453
## # ℹ 60 more rows
To address the first question, average Site EUI was calculated by grouping buildings by their largest property use type.
Average energy use intensity varies substantially across property types.
Energy-intensive uses such as laboratories or healthcare-related buildings tend to have higher average EUI values compared to offices or residential buildings.
This confirms that building function is a major driver of energy consumption and should be accounted for when comparing building performance or designing efficiency interventions.
data %>%
group_by(largest_property_use_type) %>%
summarise(
avg_eui = mean(site_eui_k_btu_sf, na.rm = TRUE),
n = n()
) %>%
arrange(desc(avg_eui))
## # A tibble: 70 × 3
## largest_property_use_type avg_eui n
## <chr> <dbl> <int>
## 1 Data Center 862. 34
## 2 Other - Services 535. 35
## 3 Residential Care Facility 363. 27
## 4 Supermarket/Grocery Store 229. 373
## 5 Laboratory 227. 212
## 6 Hospital (General Medical & Surgical) 198. 109
## 7 Restaurant 174. 123
## 8 Medical Office 172. 453
## 9 Other/Specialty Hospital 164. 41
## 10 Other 163. 655
## # ℹ 60 more rows
A histogram was used to visualize the distribution of ENERGY STAR scores.
Insights
Scores cluster in the mid-range, with fewer buildings achieving very high or very low scores.
A substantial number of buildings lack ENERGY STAR scores, as shown by the missing data analysis.
Significance
This highlights both overall efficiency trends and data limitations that may affect interpretation.
ggplot(data, aes(x = energystar_score)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(
title = "Distribution of ENERGY STAR Scores",
x = "ENERGY STAR Score",
y = "Count"
)
## Warning: Removed 9285 rows containing non-finite outside the scale range
## (`stat_bin()`).
data %>%
summarise(
missing_energy_star = sum(is.na(energystar_score)),
pct_missing = mean(is.na(energystar_score)) * 100
)
## # A tibble: 1 × 2
## missing_energy_star pct_missing
## <int> <dbl>
## 1 9285 26.8
A boxplot was used to visualize Site EUI across property use types, with color used to distinguish categories.
Insights
Energy use intensity differs markedly by property use type.
Some categories show wide variability and notable outliers, indicating inconsistent energy performance within the same building function.
Significance
Visualizing interactions between categorical and continuous variables reveals heterogeneity that would be missed in aggregate summaries alone.
data %>%
filter(!is.na(largest_property_use_type), !is.na(site_eui_k_btu_sf)) %>%
ggplot(aes(
x = largest_property_use_type,
y = site_eui_k_btu_sf,
fill = largest_property_use_type
)) +
geom_boxplot(outlier.alpha = 0.3) +
coord_flip() +
labs(
title = "Energy Use Intensity by Property Use Type (Non-missing)",
x = "Property Use Type",
y = "Site EUI (kBtu/sf)"
) +
theme(legend.position = "none")
Missing data was quantified for ENERGY STAR scores, Site EUI, and property use type.
A notable percentage of buildings are missing ENERGY STAR scores.
Site EUI has fewer missing values, making it a more reliable variable for certain analyses.
Missing data can bias conclusions and should be explicitly accounted for in future analyses, either through filtering, imputation, or sensitivity checks.
data %>%
summarise(
total_rows = n(),
missing_use_type = sum(is.na(largest_property_use_type)),
missing_site_eui = sum(is.na(site_eui_k_btu_sf)),
pct_missing_site_eui = mean(is.na(site_eui_k_btu_sf)) * 100
)
## # A tibble: 1 × 4
## total_rows missing_use_type missing_site_eui pct_missing_site_eui
## <int> <int> <int> <dbl>
## 1 34699 21 1275 3.67
This data dive revealed substantial variability in building energy performance, clear differences across property use types, and important data quality considerations. Future work will explore trends over time, correlations between efficiency metrics, and multivariate models that account for building characteristics such as size, use, and location.