This document presents the Exploratory Data Analysis (EDA) for the project Data-Driven Insights into NYC Property Market: Price Prediction and Transaction Classification. The dataset has already been cleaned and feature-engineered in a previous stage. The goal of this section is to understand the structure and distribution of the cleaned data, examine relationships between variables, and surface insights that guide the two downstream modelling tasks:
sale_price.high_value
(whether a transaction is in the top 25% by price).library(tidyverse) # dplyr, ggplot2, readr, etc.
library(scales) # nicer axis labels (dollar, comma)
library(corrplot) # correlation heatmap
# Adjust the path to wherever your cleaned CSV sits.
df <- read_csv("nyc_property_sales_cleaned.csv", show_col_types = FALSE)
# Ensure categorical fields are treated as factors for plotting.
cat_cols <- c("borough_name", "building_class_category", "tax_class_at_present",
"high_value_label")
df <- df %>% mutate(across(any_of(cat_cols), as.factor))
cat("Dataset dimension:", nrow(df), "rows x", ncol(df), "columns\n")
## Dataset dimension: 27025 rows x 27 columns
# Structure and a compact summary of the key variables.
glimpse(df)
## Rows: 27,025
## Columns: 27
## $ borough <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ neighborhood <chr> "ALPHABET CITY", "ALPHABET CITY", "ALPH…
## $ building_class_category <fct> 07 RENTALS - WALKUP APARTMENTS, 07 RENT…
## $ tax_class_at_present <fct> 2A, 2B, 2A, 2B, 2, 2B, 2, 2A, 2A, 2A, 4…
## $ block <dbl> 392, 402, 404, 406, 387, 400, 376, 391,…
## $ lot <dbl> 6, 21, 55, 32, 153, 21, 14, 19, 4, 5, 3…
## $ building_class_at_present <chr> "C2", "C4", "C2", "C4", "D9", "D1", "C6…
## $ address <chr> "153 AVENUE B", "154 EAST 7TH STREET", …
## $ apartment_number <chr> "Unknown", "Unknown", "Unknown", "Unkno…
## $ zip_code <dbl> 10009, 10009, 10009, 10009, 10009, 1000…
## $ residential_units <dbl> 5, 10, 6, 8, 24, 10, 24, 3, 4, 5, 0, 1,…
## $ commercial_units <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, …
## $ total_units <dbl> 5, 10, 6, 8, 24, 10, 24, 4, 5, 6, 1, 1,…
## $ land_square_feet <dbl> 1633, 2272, 2369, 1750, 4489, 3717, 413…
## $ gross_square_feet <dbl> 6440, 6794, 4615, 4226, 18523, 12350, 1…
## $ year_built <dbl> 1900, 1913, 1900, 1920, 1920, 2009, 192…
## $ tax_class_at_time_of_sale <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 1, 1, …
## $ building_class_at_time_of_sale <chr> "C2", "C4", "C2", "C4", "D9", "D1", "C6…
## $ sale_price <dbl> 6625000, 3936272, 8000000, 3192840, 162…
## $ sale_date <date> 2017-07-19, 2016-09-23, 2016-11-17, 20…
## $ sale_year <dbl> 2017, 2016, 2016, 2016, 2016, 2016, 201…
## $ sale_month <dbl> 7, 9, 11, 9, 11, 10, 6, 11, 1, 4, 9, 12…
## $ borough_name <fct> Manhattan, Manhattan, Manhattan, Manhat…
## $ property_age <dbl> 117, 103, 116, 96, 96, 7, 89, 106, 117,…
## $ price_per_sqft <dbl> 1028.72671, 579.37474, 1733.47779, 755.…
## $ high_value <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ high_value_label <fct> High Value, High Value, High Value, Hig…
summary(df %>% select(sale_price, gross_square_feet, land_square_feet,
property_age, price_per_sqft, total_units))
## sale_price gross_square_feet land_square_feet property_age
## Min. : 90000 Min. : 720 Min. : 200 Min. : 0.00
## 1st Qu.: 446160 1st Qu.: 1360 1st Qu.: 2000 1st Qu.: 57.00
## Median : 635000 Median : 1848 Median : 2500 Median : 86.00
## Mean : 963664 Mean : 2346 Mean : 3139 Mean : 75.88
## 3rd Qu.: 950000 3rd Qu.: 2580 3rd Qu.: 3840 3rd Qu.: 97.00
## Max. :16232000 Max. :21750 Max. :102862 Max. :217.00
## price_per_sqft total_units
## Min. : 11.89 Min. : 0.000
## 1st Qu.: 251.03 1st Qu.: 1.000
## Median : 352.68 Median : 2.000
## Mean : 409.63 Mean : 2.161
## 3rd Qu.: 493.01 3rd Qu.: 2.000
## Max. :4981.88 Max. :2261.000
# Confirm the cleaned data has (almost) no missing values.
missing_tbl <- df %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
filter(n_missing > 0) %>%
arrange(desc(n_missing))
missing_tbl
## # A tibble: 1 × 2
## variable n_missing
## <chr> <int>
## 1 zip_code 8
The cleaning stage removed virtually all missing values; only a
handful remain in zip_code, which is not used as a
modelling feature.
ggplot(df, aes(x = sale_price)) +
geom_histogram(bins = 60, fill = "#2c7fb8", colour = "white") +
scale_x_continuous(labels = dollar_format(scale = 1e-6, suffix = "M")) +
labs(title = "Distribution of Sale Price (raw scale)",
x = "Sale Price", y = "Count") +
theme_minimal()
The raw sale_price distribution is strongly
right-skewed (skewness is about 5.7): most properties sell
below $1M while a long tail reaches above $16M. A skewed target like
this is hard for linear models to fit directly, which motivates a
log transformation.
ggplot(df, aes(x = sale_price)) +
geom_histogram(bins = 60, fill = "#2c7fb8", colour = "white") +
scale_x_log10(labels = dollar_format()) +
labs(title = "Distribution of Sale Price (log10 scale)",
x = "Sale Price (log scale)", y = "Count") +
theme_minimal()
On the log scale the distribution becomes far more symmetric and
bell-shaped. Insight for modelling: using
log(sale_price) as the regression target is likely to
improve linear-model performance.
df %>%
count(high_value_label) %>%
mutate(pct = round(100 * n / sum(n), 1)) %>%
ggplot(aes(x = high_value_label, y = n, fill = high_value_label)) +
geom_col(width = 0.6) +
geom_text(aes(label = paste0(comma(n), "\n(", pct, "%)")), vjust = -0.2) +
scale_fill_manual(values = c("High Value" = "#d95f0e", "Regular Value" = "#7fcdbb")) +
labs(title = "Class Balance of high_value_label",
x = NULL, y = "Count") +
theme_minimal() + theme(legend.position = "none")
The classes are imbalanced: roughly 25% “High Value” and 75% “Regular Value”. The high-value label is defined using the 75th percentile of sale price, so it represents the top quarter of transactions in the dataset. Insight for modelling: accuracy alone will be misleading; the classification stage should also report precision, recall, F1, and ROC-AUC.
num_vars <- c("gross_square_feet", "land_square_feet",
"property_age", "price_per_sqft")
df %>%
select(all_of(num_vars)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 50, fill = "#41b6c4", colour = "white") +
facet_wrap(~ variable, scales = "free", ncol = 2) +
labs(title = "Distributions of Key Numeric Features", x = NULL, y = "Count") +
theme_minimal()
gross_square_feet and land_square_feet are
right-skewed, while property_age spans a wide range (many
older buildings in NYC). price_per_sqft is a derived
feature; note it is calculated from sale_price and must be
excluded as a predictor in both models to avoid data
leakage.
df %>%
mutate(borough_name = fct_reorder(borough_name, sale_price, .fun = median)) %>%
ggplot(aes(x = borough_name, y = sale_price, fill = borough_name)) +
geom_boxplot(outlier.alpha = 0.15) +
scale_y_log10(labels = dollar_format()) +
labs(title = "Sale Price by Borough (log scale)",
x = "Borough", y = "Sale Price (log scale)") +
theme_minimal() + theme(legend.position = "none")
df %>% count(borough_name, sort = TRUE)
## # A tibble: 5 × 2
## borough_name n
## <fct> <int>
## 1 Queens 10453
## 2 Brooklyn 7988
## 3 Staten Island 4744
## 4 Bronx 3200
## 5 Manhattan 640
Manhattan has the highest median prices but very few transactions (is about 640), whereas Queens and Brooklyn dominate the sample volume. Insight: borough is a strong driver of price and should be an important categorical predictor.
ggplot(df, aes(x = gross_square_feet, y = sale_price)) +
geom_point(alpha = 0.12, colour = "#2c7fb8") +
geom_smooth(method = "lm", colour = "#d95f0e", se = FALSE) +
scale_x_log10(labels = comma) +
scale_y_log10(labels = dollar_format()) +
labs(title = "Sale Price vs. Gross Square Feet (log-log scale)",
x = "Gross Square Feet (log scale)", y = "Sale Price (log scale)") +
theme_minimal()
There is a clear positive relationship between floor area and price. Gross square feet is the strongest single numeric predictor, with a correlation of about 0.64 on the raw scale. The log-log view shows a roughly linear trend, again supporting log-transformed modelling.
top_classes <- df %>%
count(building_class_category, sort = TRUE) %>%
slice_head(n = 10) %>%
pull(building_class_category)
df %>%
filter(building_class_category %in% top_classes) %>%
mutate(building_class_category =
fct_reorder(building_class_category, sale_price, .fun = median)) %>%
ggplot(aes(x = sale_price, y = building_class_category)) +
geom_boxplot(fill = "#7fcdbb", outlier.alpha = 0.1) +
scale_x_log10(labels = dollar_format()) +
labs(title = "Sale Price by Building Class (Top 10 by frequency, log scale)",
x = "Sale Price (log scale)", y = NULL) +
theme_minimal()
Different building categories occupy clearly different price bands, confirming this variable carries useful signal for both tasks.
ggplot(df, aes(x = property_age, y = sale_price)) +
geom_point(alpha = 0.1, colour = "#41b6c4") +
geom_smooth(method = "loess", colour = "#d95f0e", se = FALSE) +
scale_x_continuous(limits = c(0, 150)) +
scale_y_log10(labels = dollar_format()) +
labs(title = "Sale Price vs. Property Age",
x = "Property Age (years)", y = "Sale Price (log scale)") +
theme_minimal()
The relationship is weak and non-linear (correlation is about 0.15):
the loess curve shows a shallow U-shape, where both newer buildings and
much older (often renovated or landmark) buildings command higher prices
than mid-aged ones. Note that very old properties (beyond ~120 years)
are rare, so the upward turn at the far right is driven by few records
and should not be over-interpreted. Insight:
property_age contributes modest, non-linear signal.
Tree-based models may capture it better than linear models.
num_for_corr <- df %>%
select(sale_price, gross_square_feet, land_square_feet, property_age,
residential_units, commercial_units, total_units,
price_per_sqft, year_built) %>%
drop_na()
corr_mat <- cor(num_for_corr)
corrplot(corr_mat, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
number.cex = 0.7, diag = FALSE,
title = "Correlation Matrix of Numeric Features", mar = c(0,0,1,0))
gross_square_feet is the strongest correlate of
sale_price, with a correlation of about 0.64 on the raw
scale. price_per_sqft is also closely related to price
because it is derived from sale_price. This is a reminder
to exclude both price_per_sqft and high_value
from the predictor set to avoid data leakage.
The heatmap also surfaces two redundancy issues that matter for modelling:
total_units is derived from residential_units
and commercial_units. To reduce redundancy and avoid
multicollinearity, only one unit-related variable should be kept for
modelling.property_age and year_built are perfectly
negatively correlated because
property_age = sale_year - year_built. They represent the
same underlying information, so only one should be kept. In this
project, property_age is preferred because it is more
interpretable.Removing these redundant predictors will make the regression model
more stable and easier to interpret. gross_square_feet is
the strongest correlate of sale_price, with a correlation
of about 0.64 on the raw scale. price_per_sqft is also
closely related to price because it is derived from
sale_price. This is a reminder to exclude both
price_per_sqft and high_value from the
predictor set to avoid data leakage.
The heatmap also surfaces two redundancy issues that matter for modelling:
total_units is derived from residential_units
and commercial_units. To reduce redundancy and avoid
multicollinearity, only one unit-related variable should be kept for
modelling.property_age and year_built are perfectly
negatively correlated because
property_age = sale_year - year_built. They represent the
same underlying information, so only one should be kept. In this
project, property_age is preferred because it is more
interpretable.Removing these redundant predictors will make the regression model more stable and easier to interpret.
df %>%
count(sale_year, sale_month) %>%
mutate(period = as.Date(sprintf("%d-%02d-01", sale_year, sale_month))) %>%
ggplot(aes(x = period, y = n)) +
geom_line(colour = "#2c7fb8", linewidth = 1) +
geom_point(colour = "#2c7fb8") +
labs(title = "Monthly Transaction Volume (2016-2017)",
x = NULL, y = "Number of Sales") +
theme_minimal()
The data covers 2016-2017. Transaction volume fluctuates month to month but shows no extreme seasonality that would dominate modelling.
The exploratory analysis surfaces several insights that directly inform the modelling stage:
log(sale_price) should be considered as the regression
target.price_per_sqft and high_value must
be excluded as predictors because both are derived from
sale_price and may cause data leakage.address and neighborhood should be dropped or
reduced before modelling.total_units is derived from
residential_units and commercial_units. To
reduce redundancy, only one unit-related predictor should be kept.property_age and year_built
represent the same underlying information, so only one of them
should be included in the final model.These findings provide the basis for feature selection, target transformation, leakage prevention, and model evaluation in the regression and classification stages.