Commercial real estate portfolios often struggle to bridge the gap between technical sustainability goals and financial performance. While datasets like NYC’s Local Law 84 (LL84) provide a wealth of raw utility data, they often lack direct “Total Cost” metrics, making it difficult for strategists to quantify the immediate ROI of “Green” retrofits or identify which specific assets offer the fastest capital recovery.
This project plan bridges the gap between technical engineering and real estate strategy. By using the NYC Local Law 84 (LL84) dataset, we can transform raw utility data into actionable investment insights.
In addition, this project transforms the NYC LL84 Benchmarking dataset into a strategic investment roadmap by merging chemical engineering principles with data analytics. By establishing a proxy energy cost of $0.025 per kBtu, the analysis models the financial impact of a 20% efficiency improvement across a commercial portfolio. Using a robust R-based pipeline—including tidyverse for wrangling and PerformanceAnalytics for statistical validation—this study identifies high-priority properties for intervention based on payback periods, energy intensity, and carbon reduction potential.
I will pull the data and handle the “real-world” messiness of missing values and non-standardized units.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
The next step is to load the installed package
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The next step is to load the dataset needed for this analysis. I used the 5zyy-y8am identifier, which is the current “Master” dataset for LL84 Benchmarking
url <- "https://data.cityofnewyork.us/resource/5zyy-y8am.csv"
raw_data <- read_csv(url)
## Rows: 1000 Columns: 265
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (222): property_name, parent_property_id, parent_property_name, nyc_bor...
## dbl (20): report_year, property_id, postal_code, largest_property_use_type...
## lgl (13): eligible_for_certification_for_report_ped_y_n_, energy_star_cert...
## dttm (10): year_ending, last_modified_date_property, last_modified_date_ele...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
It is pertinent to Focus on NYC Properties with valid square footage. Having noticed inconsistencies in the raw NYC Open Data schema, where numerical values were stored as character strings. I implemented a cleaning pipeline in R to ensure mathematical integrity before calculating the baseline energy intensity.
cleaned_data <- raw_data %>%
filter(!is.na(`energy_star_score`) & `energy_star_score` != "not available") %>%
mutate(across(c(`energy_star_score`, `property_gfa_self_reported`,
`source_eui_kbtu_ft`, `net_emissions_metric_tons`), as.numeric)) %>%
filter(`property_gfa_self_reported` > 0)
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
Since the dataset lacks a direct “Total Cost” column, I applied a chemical engineering approach to estimate energy expenditure based on typical NYC commercial utility rates (blending electricity and natural gas). This analysis is based on the blended rate of $0.025 per kBtu.
processed_data <- cleaned_data %>%
mutate(
# Secondary Variable 1: Annual Energy Cost Proxy
proxy_energy_cost = `source_eui_kbtu_ft` * `property_gfa_self_reported` * 0.025,
# Secondary Variable 2: Efficiency Metric
energy_intensity = `source_eui_kbtu_ft`,
# Building Age calculation
building_age = report_year - as.numeric(`year_built`)
) %>%
filter(proxy_energy_cost > 0)
head(processed_data %>% select(`property_name`, building_age, proxy_energy_cost))
## # A tibble: 6 × 3
## property_name building_age proxy_energy_cost
## <chr> <dbl> <dbl>
## 1 58-01 Grand Avenue 92 16818.
## 2 1870 Pelham Parkway South 62 153132.
## 3 Central Building 81 120018.
## 4 215 East 99th Street 124 287923.
## 5 23-25 31 Street 111 263168.
## 6 1680 Ocean Ave 87 185706
It is pertinent to check if a better Energy Star score actually correlates with lower costs, proving the “Green” hypothesis.
ggplot(processed_data, aes(x = `energy_star_score`, y = proxy_energy_cost)) +
geom_point(alpha = 0.4, color = "darkgreen") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Energy Star Score vs. Estimated Annual Energy Cost",
x = "Energy Star Score (Higher is Better)",
y = "Proxy Annual Cost ($)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 115 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 115 rows containing missing values or values outside the scale range
## (`geom_point()`).
Assumption: I assume a standard retrofit cost (e.g., HVAC/Insulation mix) of $2.50 per sq. ft.
# Define Retrofit Assumptions
cost_per_sqft <- 2.50
efficiency_gain <- 0.20 # 20% reduction in eui
# Calculate Payback and Value Lift
strategic_data <- processed_data %>%
mutate(
# Initial Investment (CapEx)
retrofit_cost = `property_gfa_self_reported` * cost_per_sqft,
# Annual Savings
annual_savings = proxy_energy_cost * efficiency_gain,
# Payback Period in Years
payback_years = retrofit_cost / annual_savings,
# ROI over 5 Years
five_year_ROI_percent = ((annual_savings * 5) - retrofit_cost) / retrofit_cost * 100
) %>%
# Filter out outliers or unrealistic payback periods (e.g., > 30 years)
filter(payback_years > 0 & payback_years < 25)
# View top 5 candidates for immediate retrofit
head(strategic_data %>% arrange(payback_years) %>%
select(`property_name`, payback_years, five_year_ROI_percent))
## # A tibble: 6 × 3
## property_name payback_years five_year_ROI_percent
## <chr> <dbl> <dbl>
## 1 Storage Garage 0.158 3071.
## 2 2537 Broadway 0.413 1111.
## 3 Kai Development Corp 0.479 945.
## 4 215 Varick Avenue 0.634 689.
## 5 Coda Hotel 0.646 674.
## 6 629 - Fairfield Inn LGA/Flushing 0.693 622.
While the LL84 data focuses on energy, I use the “Primary Property Type” and “Year Built” to identify footprint trends and the information is presented using heatmap.
library(ggplot2)
library(tidyr)
cor_data <- strategic_data %>%
select(building_age, `net_emissions_metric_tons`,
`source_eui_kbtu_ft`, proxy_energy_cost) %>%
cor(use = "complete.obs")
# Convert matrix to a long format for ggplot
cor_long <- as.data.frame(as.table(cor_data))
ggplot(cor_long, aes(Var1, Var2, fill = Freq)) +
geom_tile() +
geom_text(aes(label = round(Freq, 2)), color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
labs(title = "Correlation Heatmap", x = "", y = "", fill = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
To visualize the “Act” phase, there is need to show how the initial investment is recovered
# Summary for a representative property
portfolio_summary <- strategic_data %>%
summarise(
initial_investment = sum(retrofit_cost) * -1,
Year_1 = sum(annual_savings),
Year_2 = sum(annual_savings),
Year_3 = sum(annual_savings),
Year_4 = sum(annual_savings),
Year_5 = sum(annual_savings)
) %>%
tidyr::pivot_longer(cols = everything(), names_to = "Period", values_to = "Amount") %>%
mutate(Cumulative = cumsum(Amount))
# Waterfall Visualization
ggplot(portfolio_summary, aes(x = Period, y = Amount, fill = Amount > 0)) +
geom_col() +
geom_step(aes(y = Cumulative, group = 1), color = "black", linetype = "dotted") +
scale_fill_manual(values = c("red", "steelblue"), labels = c("Investment", "Savings")) +
labs(title = "Portfolio Payback Waterfall",
subtitle = "Cumulative Offset of Initial CapEx by Energy Savings",
y = "Net Cash Flow ($)") +
theme_light()
# Install if needed: install.packages("PerformanceAnalytics")
install.packages("PerformanceAnalytics")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(dplyr)
# 1. Prepare the subset (ensure it's numeric and has no NAs for the chart)
cor_subset <- strategic_data %>%
select(building_age,
net_emissions_metric_tons,
source_eui_kbtu_ft,
proxy_energy_cost) %>%
# Convert to numeric to avoid 'non-numeric' errors
mutate(across(everything(), as.numeric)) %>%
na.omit()
# 2. Generate the PerformanceAnalytics Chart
# We use histogram = TRUE to see the data distribution
chart.Correlation(cor_subset,
histogram = TRUE,
pch = 19,
method = "pearson",
main = "Engineering Correlation: Energy & Carbon Metrics")
The Diagonal: These histograms shows that the data is heavily right-skewed (a few massive buildings account for most of the cost).
The Bottom-Left (Scatter Plots): Since the dots are scattered like a cloud, the correlation is weak, meaning “Building Age” might not be the only thing driving costs.
The Top-Right (Correlation & Significance): * Value: 1.0 is a perfect relationship.
The presence of three stars ***, shows that the p-value is < 0.001, meaning that my findings are statistically “real” and not just a coincidence in the data. Hence, “Proxy Cost” calculation is a reliable substitute for actual bills. Also, this proves that reducing EUI will reliably slash our operating costs.
Integrating PerformanceAnalytics is a great way to move from simple “data cleaning” to “statistical validation.” In an engineering context,the stars shows that my findings are statistically significant and gives my ROI model much more credibility.
Because chart.Correlation uses Base R graphics, it doesn’t naturally “talk” to patchwork (which uses ggplot2). To solve this and keep my Executive Summary in one piece, I will use the gridGraphics and grid libraries to “capture” the base plot and turn it into an object patchwork can handle.
# Install required libraries if you haven't yet
# install.packages(c("PerformanceAnalytics", "gridGraphics", "grid", "patchwork"))
install.packages("gridGraphics")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("patchwork")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("grid")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
## Warning: package 'grid' is a base package, and should not be updated
install.packages("PerformanceAnalytics")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(PerformanceAnalytics)
library(gridGraphics)
## Loading required package: grid
library(grid)
library(patchwork)
library(ggplot2)
library(dplyr)
# --- 1. PREPARE THE DATA ---
# PerformanceAnalytics needs a clean numeric matrix
cor_data <- strategic_data %>%
select(building_age, net_emissions_metric_tons,
source_eui_kbtu_ft, proxy_energy_cost) %>%
mutate(across(everything(), as.numeric)) %>%
na.omit()
# --- 2. CAPTURE PERFORMANCEANALYTICS AS A GGPLOT-COMPATIBLE OBJECT ---
# We wrap the base plot in a function to "grab" it
p1_base <- function() {
chart.Correlation(cor_data, histogram = TRUE, pch = 19,
main = "Statistical Significance of Energy Metrics")
}
# Convert base plot to a 'grob' (Graphical Object)
library(grid)
p1_captured <- grid::grid.grabExpr(p1_base())
# --- 3. CREATE THE STRATEGIST PLOT (ROI MAP) ---
p2 <- ggplot(strategic_data, aes(x = property_gfa_self_reported, y = annual_savings)) +
geom_point(aes(size = five_year_ROI_percent, color = building_age), alpha = 0.5) +
scale_x_log10() +
scale_color_viridis_c(option = "magma") +
labs(title = "Retrofit Priority Map",
subtitle = "Sizing by ROI | Coloring by Building Age",
x = "Total Sq Ft (Log)", y = "Annual Savings ($)") +
theme_minimal()
# --- 4. ASSEMBLE THE EXECUTIVE DASHBOARD ---
# We use 'wrap_elements' to allow the captured base plot to sit next to ggplot
dashboard <- wrap_elements(p1_captured) / p2 +
plot_annotation(
title = "Portfolio Energy Strategy & Statistical Validation",
subtitle = "NYC LL84 Data: Identifying High-ROI Green Interventions",
theme = theme(plot.title = element_text(size = 18, face = "bold"))
)
print(dashboard)
Based on the visual evidence and statistical modeling executed in R, the following facts are established:
• Statistical Reliability of Proxy Metrics: The Correlation Heatmap and Engineering Correlation Matrix confirm a near-perfect relationship (0.99) between Source EUI (Energy Use Intensity) and estimated annual costs. With a p-value of < 0.001 (indicated by the *** significance stars), the proxy cost model is a statistically “real” and reliable substitute for actual utility bills.
• The Efficiency Frontier: The Scatter Plot of Energy Star Scores vs. Costs reveals a clear downward trend; higher efficiency scores directly correlate with lower operating expenditures. This proves the “Green Hypothesis” that increasing a building’s Energy Star score inherently boosts Net Operating Income (NOI).
• Identification of High-Yield Assets: The Optimization Model identified specific “low-hanging fruit” where retrofits are most viable. For example, properties such as the Storage Garage and 2537 Broadway exhibit exceptionally aggressive payback periods of 0.158 and 0.413 years, respectively.
• Portfolio Scalability: The Payback Waterfall Chart demonstrates that an initial capital expenditure (modeled at $2.50/sq ft) can be fully offset by cumulative energy savings within a short multi-year horizon.
• Data Distribution Realities: Histogram analysis on the diagonal of the correlation matrix shows the data is heavily right-skewed, indicating that a small percentage of massive, energy-intensive buildings account for the vast majority of portfolio costs and carbon footprints.