1 Introduction and Objectives

This document presents the Exploratory Data Analysis (EDA) for the project Data-Driven Insights into NYC Property Market: Price Prediction and Transaction Classification. The dataset has already been cleaned and feature-engineered in a previous stage. The goal of this section is to understand the structure and distribution of the cleaned data, examine relationships between variables, and surface insights that guide the two downstream modelling tasks:

  • Regression - predicting sale_price.
  • Classification - predicting high_value (whether a transaction is in the top 25% by price).
library(tidyverse)   # dplyr, ggplot2, readr, etc.
library(scales)      # nicer axis labels (dollar, comma)
library(corrplot)    # correlation heatmap
# Adjust the path to wherever your cleaned CSV sits.
df <- read_csv("nyc_property_sales_cleaned.csv", show_col_types = FALSE)

# Ensure categorical fields are treated as factors for plotting.
cat_cols <- c("borough_name", "building_class_category", "tax_class_at_present",
              "high_value_label")
df <- df %>% mutate(across(any_of(cat_cols), as.factor))

cat("Dataset dimension:", nrow(df), "rows x", ncol(df), "columns\n")
## Dataset dimension: 27025 rows x 27 columns

2 Dataset Overview

# Structure and a compact summary of the key variables.
glimpse(df)
## Rows: 27,025
## Columns: 27
## $ borough                        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ neighborhood                   <chr> "ALPHABET CITY", "ALPHABET CITY", "ALPH…
## $ building_class_category        <fct> 07 RENTALS - WALKUP APARTMENTS, 07 RENT…
## $ tax_class_at_present           <fct> 2A, 2B, 2A, 2B, 2, 2B, 2, 2A, 2A, 2A, 4…
## $ block                          <dbl> 392, 402, 404, 406, 387, 400, 376, 391,…
## $ lot                            <dbl> 6, 21, 55, 32, 153, 21, 14, 19, 4, 5, 3…
## $ building_class_at_present      <chr> "C2", "C4", "C2", "C4", "D9", "D1", "C6…
## $ address                        <chr> "153 AVENUE B", "154 EAST 7TH STREET", …
## $ apartment_number               <chr> "Unknown", "Unknown", "Unknown", "Unkno…
## $ zip_code                       <dbl> 10009, 10009, 10009, 10009, 10009, 1000…
## $ residential_units              <dbl> 5, 10, 6, 8, 24, 10, 24, 3, 4, 5, 0, 1,…
## $ commercial_units               <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, …
## $ total_units                    <dbl> 5, 10, 6, 8, 24, 10, 24, 4, 5, 6, 1, 1,…
## $ land_square_feet               <dbl> 1633, 2272, 2369, 1750, 4489, 3717, 413…
## $ gross_square_feet              <dbl> 6440, 6794, 4615, 4226, 18523, 12350, 1…
## $ year_built                     <dbl> 1900, 1913, 1900, 1920, 1920, 2009, 192…
## $ tax_class_at_time_of_sale      <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 1, 1, …
## $ building_class_at_time_of_sale <chr> "C2", "C4", "C2", "C4", "D9", "D1", "C6…
## $ sale_price                     <dbl> 6625000, 3936272, 8000000, 3192840, 162…
## $ sale_date                      <date> 2017-07-19, 2016-09-23, 2016-11-17, 20…
## $ sale_year                      <dbl> 2017, 2016, 2016, 2016, 2016, 2016, 201…
## $ sale_month                     <dbl> 7, 9, 11, 9, 11, 10, 6, 11, 1, 4, 9, 12…
## $ borough_name                   <fct> Manhattan, Manhattan, Manhattan, Manhat…
## $ property_age                   <dbl> 117, 103, 116, 96, 96, 7, 89, 106, 117,…
## $ price_per_sqft                 <dbl> 1028.72671, 579.37474, 1733.47779, 755.…
## $ high_value                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ high_value_label               <fct> High Value, High Value, High Value, Hig…
summary(df %>% select(sale_price, gross_square_feet, land_square_feet,
                       property_age, price_per_sqft, total_units))
##    sale_price       gross_square_feet land_square_feet  property_age   
##  Min.   :   90000   Min.   :  720     Min.   :   200   Min.   :  0.00  
##  1st Qu.:  446160   1st Qu.: 1360     1st Qu.:  2000   1st Qu.: 57.00  
##  Median :  635000   Median : 1848     Median :  2500   Median : 86.00  
##  Mean   :  963664   Mean   : 2346     Mean   :  3139   Mean   : 75.88  
##  3rd Qu.:  950000   3rd Qu.: 2580     3rd Qu.:  3840   3rd Qu.: 97.00  
##  Max.   :16232000   Max.   :21750     Max.   :102862   Max.   :217.00  
##  price_per_sqft     total_units      
##  Min.   :  11.89   Min.   :   0.000  
##  1st Qu.: 251.03   1st Qu.:   1.000  
##  Median : 352.68   Median :   2.000  
##  Mean   : 409.63   Mean   :   2.161  
##  3rd Qu.: 493.01   3rd Qu.:   2.000  
##  Max.   :4981.88   Max.   :2261.000
# Confirm the cleaned data has (almost) no missing values.
missing_tbl <- df %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
  filter(n_missing > 0) %>%
  arrange(desc(n_missing))

missing_tbl
## # A tibble: 1 × 2
##   variable n_missing
##   <chr>        <int>
## 1 zip_code         8

The cleaning stage removed virtually all missing values; only a handful remain in zip_code, which is not used as a modelling feature.

3 Target Variable 1: Sale Price (Regression Target)

ggplot(df, aes(x = sale_price)) +
  geom_histogram(bins = 60, fill = "#2c7fb8", colour = "white") +
  scale_x_continuous(labels = dollar_format(scale = 1e-6, suffix = "M")) +
  labs(title = "Distribution of Sale Price (raw scale)",
       x = "Sale Price", y = "Count") +
  theme_minimal()

The raw sale_price distribution is strongly right-skewed (skewness is about 5.7): most properties sell below $1M while a long tail reaches above $16M. A skewed target like this is hard for linear models to fit directly, which motivates a log transformation.

ggplot(df, aes(x = sale_price)) +
  geom_histogram(bins = 60, fill = "#2c7fb8", colour = "white") +
  scale_x_log10(labels = dollar_format()) +
  labs(title = "Distribution of Sale Price (log10 scale)",
       x = "Sale Price (log scale)", y = "Count") +
  theme_minimal()

On the log scale the distribution becomes far more symmetric and bell-shaped. Insight for modelling: using log(sale_price) as the regression target is likely to improve linear-model performance.

4 Target Variable 2: High-Value Transactions (Classification Target)

df %>%
  count(high_value_label) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  ggplot(aes(x = high_value_label, y = n, fill = high_value_label)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = paste0(comma(n), "\n(", pct, "%)")), vjust = -0.2) +
  scale_fill_manual(values = c("High Value" = "#d95f0e", "Regular Value" = "#7fcdbb")) +
  labs(title = "Class Balance of high_value_label",
       x = NULL, y = "Count") +
  theme_minimal() + theme(legend.position = "none")

The classes are imbalanced: roughly 25% “High Value” and 75% “Regular Value”. The high-value label is defined using the 75th percentile of sale price, so it represents the top quarter of transactions in the dataset. Insight for modelling: accuracy alone will be misleading; the classification stage should also report precision, recall, F1, and ROC-AUC.

5 Numeric Feature Distributions

num_vars <- c("gross_square_feet", "land_square_feet",
              "property_age", "price_per_sqft")

df %>%
  select(all_of(num_vars)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 50, fill = "#41b6c4", colour = "white") +
  facet_wrap(~ variable, scales = "free", ncol = 2) +
  labs(title = "Distributions of Key Numeric Features", x = NULL, y = "Count") +
  theme_minimal()

gross_square_feet and land_square_feet are right-skewed, while property_age spans a wide range (many older buildings in NYC). price_per_sqft is a derived feature; note it is calculated from sale_price and must be excluded as a predictor in both models to avoid data leakage.

6 Geographic Patterns: Price by Borough

df %>%
  mutate(borough_name = fct_reorder(borough_name, sale_price, .fun = median)) %>%
  ggplot(aes(x = borough_name, y = sale_price, fill = borough_name)) +
  geom_boxplot(outlier.alpha = 0.15) +
  scale_y_log10(labels = dollar_format()) +
  labs(title = "Sale Price by Borough (log scale)",
       x = "Borough", y = "Sale Price (log scale)") +
  theme_minimal() + theme(legend.position = "none")

df %>% count(borough_name, sort = TRUE)
## # A tibble: 5 × 2
##   borough_name      n
##   <fct>         <int>
## 1 Queens        10453
## 2 Brooklyn       7988
## 3 Staten Island  4744
## 4 Bronx          3200
## 5 Manhattan       640

Manhattan has the highest median prices but very few transactions (is about 640), whereas Queens and Brooklyn dominate the sample volume. Insight: borough is a strong driver of price and should be an important categorical predictor.

7 Price vs. Property Size

ggplot(df, aes(x = gross_square_feet, y = sale_price)) +
  geom_point(alpha = 0.12, colour = "#2c7fb8") +
  geom_smooth(method = "lm", colour = "#d95f0e", se = FALSE) +
  scale_x_log10(labels = comma) +
  scale_y_log10(labels = dollar_format()) +
  labs(title = "Sale Price vs. Gross Square Feet (log-log scale)",
       x = "Gross Square Feet (log scale)", y = "Sale Price (log scale)") +
  theme_minimal()

There is a clear positive relationship between floor area and price. Gross square feet is the strongest single numeric predictor, with a correlation of about 0.64 on the raw scale. The log-log view shows a roughly linear trend, again supporting log-transformed modelling.

8 Price by Building Class Category

top_classes <- df %>%
  count(building_class_category, sort = TRUE) %>%
  slice_head(n = 10) %>%
  pull(building_class_category)

df %>%
  filter(building_class_category %in% top_classes) %>%
  mutate(building_class_category =
           fct_reorder(building_class_category, sale_price, .fun = median)) %>%
  ggplot(aes(x = sale_price, y = building_class_category)) +
  geom_boxplot(fill = "#7fcdbb", outlier.alpha = 0.1) +
  scale_x_log10(labels = dollar_format()) +
  labs(title = "Sale Price by Building Class (Top 10 by frequency, log scale)",
       x = "Sale Price (log scale)", y = NULL) +
  theme_minimal()

Different building categories occupy clearly different price bands, confirming this variable carries useful signal for both tasks.

9 Property Age vs. Price

ggplot(df, aes(x = property_age, y = sale_price)) +
  geom_point(alpha = 0.1, colour = "#41b6c4") +
  geom_smooth(method = "loess", colour = "#d95f0e", se = FALSE) +
  scale_x_continuous(limits = c(0, 150)) +
  scale_y_log10(labels = dollar_format()) +
  labs(title = "Sale Price vs. Property Age",
       x = "Property Age (years)", y = "Sale Price (log scale)") +
  theme_minimal()

The relationship is weak and non-linear (correlation is about 0.15): the loess curve shows a shallow U-shape, where both newer buildings and much older (often renovated or landmark) buildings command higher prices than mid-aged ones. Note that very old properties (beyond ~120 years) are rare, so the upward turn at the far right is driven by few records and should not be over-interpreted. Insight: property_age contributes modest, non-linear signal. Tree-based models may capture it better than linear models.

10 Correlation Among Numeric Features

num_for_corr <- df %>%
  select(sale_price, gross_square_feet, land_square_feet, property_age,
         residential_units, commercial_units, total_units,
         price_per_sqft, year_built) %>%
  drop_na()

corr_mat <- cor(num_for_corr)

corrplot(corr_mat, method = "color", type = "upper",
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         number.cex = 0.7, diag = FALSE,
         title = "Correlation Matrix of Numeric Features", mar = c(0,0,1,0))

gross_square_feet is the strongest correlate of sale_price, with a correlation of about 0.64 on the raw scale. price_per_sqft is also closely related to price because it is derived from sale_price. This is a reminder to exclude both price_per_sqft and high_value from the predictor set to avoid data leakage.

The heatmap also surfaces two redundancy issues that matter for modelling:

  • Unit-related variables are highly correlated because total_units is derived from residential_units and commercial_units. To reduce redundancy and avoid multicollinearity, only one unit-related variable should be kept for modelling.
  • property_age and year_built are perfectly negatively correlated because property_age = sale_year - year_built. They represent the same underlying information, so only one should be kept. In this project, property_age is preferred because it is more interpretable.

Removing these redundant predictors will make the regression model more stable and easier to interpret. gross_square_feet is the strongest correlate of sale_price, with a correlation of about 0.64 on the raw scale. price_per_sqft is also closely related to price because it is derived from sale_price. This is a reminder to exclude both price_per_sqft and high_value from the predictor set to avoid data leakage.

The heatmap also surfaces two redundancy issues that matter for modelling:

  • Unit-related variables are highly correlated because total_units is derived from residential_units and commercial_units. To reduce redundancy and avoid multicollinearity, only one unit-related variable should be kept for modelling.
  • property_age and year_built are perfectly negatively correlated because property_age = sale_year - year_built. They represent the same underlying information, so only one should be kept. In this project, property_age is preferred because it is more interpretable.

Removing these redundant predictors will make the regression model more stable and easier to interpret.

11 Temporal Pattern: Sales Over Time

df %>%
  count(sale_year, sale_month) %>%
  mutate(period = as.Date(sprintf("%d-%02d-01", sale_year, sale_month))) %>%
  ggplot(aes(x = period, y = n)) +
  geom_line(colour = "#2c7fb8", linewidth = 1) +
  geom_point(colour = "#2c7fb8") +
  labs(title = "Monthly Transaction Volume (2016-2017)",
       x = NULL, y = "Number of Sales") +
  theme_minimal()

The data covers 2016-2017. Transaction volume fluctuates month to month but shows no extreme seasonality that would dominate modelling.

12 Summary of EDA Findings

13 Summary of EDA Findings

The exploratory analysis surfaces several insights that directly inform the modelling stage:

  1. Sale price is heavily right-skewed; therefore, log(sale_price) should be considered as the regression target.
  2. The classification target is imbalanced, with approximately 25% High Value and 75% Regular Value transactions. Therefore, F1-score, recall, ROC-AUC and PR-AUC should be used together with accuracy.
  3. Gross square feet is the strongest numeric predictor of sale price, while borough and building class are important categorical drivers.
  4. price_per_sqft and high_value must be excluded as predictors because both are derived from sale_price and may cause data leakage.
  5. High-cardinality fields such as address and neighborhood should be dropped or reduced before modelling.
  6. Unit-related variables are highly correlated because total_units is derived from residential_units and commercial_units. To reduce redundancy, only one unit-related predictor should be kept.
  7. property_age and year_built represent the same underlying information, so only one of them should be included in the final model.
  8. Property age has a weak and non-linear relationship with sale price, suggesting that tree-based models may capture this pattern better than purely linear models.

These findings provide the basis for feature selection, target transformation, leakage prevention, and model evaluation in the regression and classification stages.