Problem statement

Commercial real estate portfolios often struggle to bridge the gap between technical sustainability goals and financial performance. While datasets like NYC’s Local Law 84 (LL84) provide a wealth of raw utility data, they often lack direct “Total Cost” metrics, making it difficult for strategists to quantify the immediate ROI of “Green” retrofits or identify which specific assets offer the fastest capital recovery.

Introduction

This project plan bridges the gap between technical engineering and real estate strategy. By using the NYC Local Law 84 (LL84) dataset, we can transform raw utility data into actionable investment insights.

In addition, this project transforms the NYC LL84 Benchmarking dataset into a strategic investment roadmap by merging chemical engineering principles with data analytics. By establishing a proxy energy cost of $0.025 per kBtu, the analysis models the financial impact of a 20% efficiency improvement across a commercial portfolio. Using a robust R-based pipeline—including tidyverse for wrangling and PerformanceAnalytics for statistical validation—this study identifies high-priority properties for intervention based on payback periods, energy intensity, and carbon reduction potential.

Data Acquisition & Wrangling (Process)

I will pull the data and handle the “real-world” messiness of missing values and non-standardized units.

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

The next step is to load the installed package

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The next step is to load the dataset needed for this analysis. I used the 5zyy-y8am identifier, which is the current “Master” dataset for LL84 Benchmarking

url <- "https://data.cityofnewyork.us/resource/5zyy-y8am.csv"
raw_data <- read_csv(url)
## Rows: 1000 Columns: 265
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (222): property_name, parent_property_id, parent_property_name, nyc_bor...
## dbl   (20): report_year, property_id, postal_code, largest_property_use_type...
## lgl   (13): eligible_for_certification_for_report_ped_y_n_, energy_star_cert...
## dttm  (10): year_ending, last_modified_date_property, last_modified_date_ele...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning:

It is pertinent to Focus on NYC Properties with valid square footage. Having noticed inconsistencies in the raw NYC Open Data schema, where numerical values were stored as character strings. I implemented a cleaning pipeline in R to ensure mathematical integrity before calculating the baseline energy intensity.

cleaned_data <- raw_data %>%
  filter(!is.na(`energy_star_score`) & `energy_star_score` != "not available") %>%
  mutate(across(c(`energy_star_score`, `property_gfa_self_reported`, 
                  `source_eui_kbtu_ft`, `net_emissions_metric_tons`), as.numeric)) %>%
  filter(`property_gfa_self_reported` > 0)
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

Analyze (The Proxy Energy Cost)

Since the dataset lacks a direct “Total Cost” column, I applied a chemical engineering approach to estimate energy expenditure based on typical NYC commercial utility rates (blending electricity and natural gas). This analysis is based on the blended rate of $0.025 per kBtu.

processed_data <- cleaned_data %>%
  mutate(
    # Secondary Variable 1: Annual Energy Cost Proxy
    proxy_energy_cost = `source_eui_kbtu_ft` * `property_gfa_self_reported` * 0.025,
    
    # Secondary Variable 2: Efficiency Metric
    energy_intensity = `source_eui_kbtu_ft`,
    
    # Building Age calculation
    building_age = report_year - as.numeric(`year_built`)
  ) %>%
  filter(proxy_energy_cost > 0)

Displaying a snippet of the engineered table

head(processed_data %>% select(`property_name`, building_age, proxy_energy_cost))
## # A tibble: 6 × 3
##   property_name             building_age proxy_energy_cost
##   <chr>                            <dbl>             <dbl>
## 1 58-01 Grand Avenue                  92            16818.
## 2 1870 Pelham Parkway South           62           153132.
## 3 Central Building                    81           120018.
## 4 215 East 99th Street               124           287923.
## 5 23-25 31 Street                    111           263168.
## 6 1680 Ocean Ave                      87           185706

Exploratory Data Analysis (Visual Proof)

It is pertinent to check if a better Energy Star score actually correlates with lower costs, proving the “Green” hypothesis.

ggplot(processed_data, aes(x = `energy_star_score`, y = proxy_energy_cost)) +
  geom_point(alpha = 0.4, color = "darkgreen") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Energy Star Score vs. Estimated Annual Energy Cost",
       x = "Energy Star Score (Higher is Better)",
       y = "Proxy Annual Cost ($)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 115 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 115 rows containing missing values or values outside the scale range
## (`geom_point()`).

Share & Act (The Optimization Model)

To act, I simulated a 20% improvement in insulation/HVAC efficiency (reducing EUI by 20%) and calculated the 5-year savings using the “What If” Strategy.

intervention_analysis <- processed_data %>%
  mutate(
    current_5yr_cost = proxy_energy_cost * 5,
    post_retrofit_5yr_cost = (proxy_energy_cost * 0.80) * 5,
    five_year_savings = current_5yr_cost - post_retrofit_5yr_cost,
    # Green Premium: Estimating a 3% increase in asset value due to efficiency
    green_premium = proxy_energy_cost * 0.03 * 10 # Simplified valuation multiplier
  )

# Summary for Executive Scorecard
summary_stats <- intervention_analysis %>%
  summarise(
    avg_5yr_savings = mean(five_year_savings, na.rm = TRUE),
    total_portfolio_ghg_reduction = sum(`net_emissions_metric_tons`, na.rm = TRUE) * 0.20
  )

print(summary_stats)
## # A tibble: 1 × 2
##   avg_5yr_savings total_portfolio_ghg_reduction
##             <dbl>                         <dbl>
## 1         272169.                        87137.

To accurately model the Payback Period, I incorporated the upfront capital expenditure (CapEx) of the retrofit. In real estate engineering, this is often calculated as a cost per square foot (ft2).

The Optimization & Payback Model

Assumption: I assume a standard retrofit cost (e.g., HVAC/Insulation mix) of $2.50 per sq. ft.

# Define Retrofit Assumptions
cost_per_sqft <- 2.50
efficiency_gain <- 0.20 # 20% reduction in eui

# Calculate Payback and Value Lift
strategic_data <- processed_data %>%
  mutate(
    # Initial Investment (CapEx)
    retrofit_cost = `property_gfa_self_reported` * cost_per_sqft,
    
    # Annual Savings
    annual_savings = proxy_energy_cost * efficiency_gain,
    
    # Payback Period in Years
    payback_years = retrofit_cost / annual_savings,
    
    # ROI over 5 Years
    five_year_ROI_percent = ((annual_savings * 5) - retrofit_cost) / retrofit_cost * 100
  ) %>%
  # Filter out outliers or unrealistic payback periods (e.g., > 30 years)
  filter(payback_years > 0 & payback_years < 25)

# View top 5 candidates for immediate retrofit
head(strategic_data %>% arrange(payback_years) %>% 
       select(`property_name`, payback_years, five_year_ROI_percent))
## # A tibble: 6 × 3
##   property_name                    payback_years five_year_ROI_percent
##   <chr>                                    <dbl>                 <dbl>
## 1 Storage Garage                           0.158                 3071.
## 2 2537 Broadway                            0.413                 1111.
## 3 Kai Development Corp                     0.479                  945.
## 4 215 Varick Avenue                        0.634                  689.
## 5 Coda Hotel                               0.646                  674.
## 6 629 - Fairfield Inn LGA/Flushing         0.693                  622.

Detailed Visualizations (The Analyst & Strategist Side)

Correlation Matrix: Age vs. Carbon vs. Material

While the LL84 data focuses on energy, I use the “Primary Property Type” and “Year Built” to identify footprint trends and the information is presented using heatmap.

library(ggplot2)
library(tidyr)

cor_data <- strategic_data %>%
  select(building_age, `net_emissions_metric_tons`, 
         `source_eui_kbtu_ft`, proxy_energy_cost) %>%
  cor(use = "complete.obs")

# Convert matrix to a long format for ggplot
cor_long <- as.data.frame(as.table(cor_data))

ggplot(cor_long, aes(Var1, Var2, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = round(Freq, 2)), color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  labs(title = "Correlation Heatmap", x = "", y = "", fill = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The “Green Premium” Scatter Plot

This plot demonstrates how higher Energy Star scores (better efficiency) lead to lower operating costs, which inherently increases the Net Operating Income (NOI) and property value.

ggplot(strategic_data, aes(x = `energy_star_score`, y = proxy_energy_cost)) +
  geom_point(aes(size = `property_gfa_self_reported`, color = building_age), alpha = 0.5) +
  geom_smooth(method = "lm", color = "darkgreen", linetype = "dashed") +
  scale_color_viridis_c(option = "plasma") +
  labs(title = "The Efficiency Frontier",
       subtitle = "Relationship between Energy Star Score and Annual Operating Costs",
       x = "Energy Star Score",
       y = "Annual Energy Cost ($)",
       size = "Building Size (sqft)",
       color = "Age of Building") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 108 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 108 rows containing missing values or values outside the scale range
## (`geom_point()`).

Investment Waterfall (Payback Over 5 Years)

To visualize the “Act” phase, there is need to show how the initial investment is recovered

# Summary for a representative property
portfolio_summary <- strategic_data %>%
  summarise(
    initial_investment = sum(retrofit_cost) * -1,
    Year_1 = sum(annual_savings),
    Year_2 = sum(annual_savings),
    Year_3 = sum(annual_savings),
    Year_4 = sum(annual_savings),
    Year_5 = sum(annual_savings)
  ) %>%
  tidyr::pivot_longer(cols = everything(), names_to = "Period", values_to = "Amount") %>%
  mutate(Cumulative = cumsum(Amount))

# Waterfall Visualization
ggplot(portfolio_summary, aes(x = Period, y = Amount, fill = Amount > 0)) +
  geom_col() +
  geom_step(aes(y = Cumulative, group = 1), color = "black", linetype = "dotted") +
  scale_fill_manual(values = c("red", "steelblue"), labels = c("Investment", "Savings")) +
  labs(title = "Portfolio Payback Waterfall",
       subtitle = "Cumulative Offset of Initial CapEx by Energy Savings",
       y = "Net Cash Flow ($)") +
  theme_light()

# Install if needed: install.packages("PerformanceAnalytics")
install.packages("PerformanceAnalytics")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
library(dplyr)

# 1. Prepare the subset (ensure it's numeric and has no NAs for the chart)
cor_subset <- strategic_data %>%
  select(building_age, 
         net_emissions_metric_tons, 
         source_eui_kbtu_ft, 
         proxy_energy_cost) %>%
  # Convert to numeric to avoid 'non-numeric' errors
  mutate(across(everything(), as.numeric)) %>%
  na.omit()

# 2. Generate the PerformanceAnalytics Chart
# We use histogram = TRUE to see the data distribution
chart.Correlation(cor_subset, 
                  histogram = TRUE, 
                  pch = 19, 
                  method = "pearson",
                  main = "Engineering Correlation: Energy & Carbon Metrics")

Engineer’s View

The Diagonal: These histograms shows that the data is heavily right-skewed (a few massive buildings account for most of the cost).

The Bottom-Left (Scatter Plots): Since the dots are scattered like a cloud, the correlation is weak, meaning “Building Age” might not be the only thing driving costs.

The Top-Right (Correlation & Significance): * Value: 1.0 is a perfect relationship.

The presence of three stars ***, shows that the p-value is < 0.001, meaning that my findings are statistically “real” and not just a coincidence in the data. Hence, “Proxy Cost” calculation is a reliable substitute for actual bills. Also, this proves that reducing EUI will reliably slash our operating costs.

The Complete Executive Summary (Engineering Edition)

Integrating PerformanceAnalytics is a great way to move from simple “data cleaning” to “statistical validation.” In an engineering context,the stars shows that my findings are statistically significant and gives my ROI model much more credibility.

Because chart.Correlation uses Base R graphics, it doesn’t naturally “talk” to patchwork (which uses ggplot2). To solve this and keep my Executive Summary in one piece, I will use the gridGraphics and grid libraries to “capture” the base plot and turn it into an object patchwork can handle.

# Install required libraries if you haven't yet
# install.packages(c("PerformanceAnalytics", "gridGraphics", "grid", "patchwork"))

install.packages("gridGraphics")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("patchwork")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("grid")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
## Warning: package 'grid' is a base package, and should not be updated
install.packages("PerformanceAnalytics")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(PerformanceAnalytics)
library(gridGraphics)
## Loading required package: grid
library(grid)
library(patchwork)
library(ggplot2)
library(dplyr)

# --- 1. PREPARE THE DATA ---
# PerformanceAnalytics needs a clean numeric matrix
cor_data <- strategic_data %>%
  select(building_age, net_emissions_metric_tons, 
         source_eui_kbtu_ft, proxy_energy_cost) %>%
  mutate(across(everything(), as.numeric)) %>%
  na.omit()

# --- 2. CAPTURE PERFORMANCEANALYTICS AS A GGPLOT-COMPATIBLE OBJECT ---
# We wrap the base plot in a function to "grab" it
p1_base <- function() {
  chart.Correlation(cor_data, histogram = TRUE, pch = 19, 
                    main = "Statistical Significance of Energy Metrics")
}

# Convert base plot to a 'grob' (Graphical Object)
library(grid)
p1_captured <- grid::grid.grabExpr(p1_base())

# --- 3. CREATE THE STRATEGIST PLOT (ROI MAP) ---
p2 <- ggplot(strategic_data, aes(x = property_gfa_self_reported, y = annual_savings)) +
  geom_point(aes(size = five_year_ROI_percent, color = building_age), alpha = 0.5) +
  scale_x_log10() +
  scale_color_viridis_c(option = "magma") +
  labs(title = "Retrofit Priority Map",
       subtitle = "Sizing by ROI | Coloring by Building Age",
       x = "Total Sq Ft (Log)", y = "Annual Savings ($)") +
  theme_minimal()

# --- 4. ASSEMBLE THE EXECUTIVE DASHBOARD ---
# We use 'wrap_elements' to allow the captured base plot to sit next to ggplot
dashboard <- wrap_elements(p1_captured) / p2 + 
  plot_annotation(
    title = "Portfolio Energy Strategy & Statistical Validation",
    subtitle = "NYC LL84 Data: Identifying High-ROI Green Interventions",
    theme = theme(plot.title = element_text(size = 18, face = "bold"))
  )

print(dashboard)

Conclusion

Based on the visual evidence and statistical modeling executed in R, the following facts are established:

• Statistical Reliability of Proxy Metrics: The Correlation Heatmap and Engineering Correlation Matrix confirm a near-perfect relationship (0.99) between Source EUI (Energy Use Intensity) and estimated annual costs. With a p-value of < 0.001 (indicated by the *** significance stars), the proxy cost model is a statistically “real” and reliable substitute for actual utility bills.

• The Efficiency Frontier: The Scatter Plot of Energy Star Scores vs. Costs reveals a clear downward trend; higher efficiency scores directly correlate with lower operating expenditures. This proves the “Green Hypothesis” that increasing a building’s Energy Star score inherently boosts Net Operating Income (NOI).

• Identification of High-Yield Assets: The Optimization Model identified specific “low-hanging fruit” where retrofits are most viable. For example, properties such as the Storage Garage and 2537 Broadway exhibit exceptionally aggressive payback periods of 0.158 and 0.413 years, respectively.

• Portfolio Scalability: The Payback Waterfall Chart demonstrates that an initial capital expenditure (modeled at $2.50/sq ft) can be fully offset by cumulative energy savings within a short multi-year horizon.

• Data Distribution Realities: Histogram analysis on the diagonal of the correlation matrix shows the data is heavily right-skewed, indicating that a small percentage of massive, energy-intensive buildings account for the vast majority of portfolio costs and carbon footprints.