New York City Property Market

Overview

This project seeks to research NYC property sales patterns and identify which features may affect sales prices. Such features may include neighborhood, building type, land/gross square footage, tax class and year built. Our goal is to uncover the price distribution in the citywide and trends that could help predict future property values.

Two data sources will be used: a csv file of NYC rolling sales (originally NYC Open Data) and the City of New York API endpoint (Department of Finance). The workflow will include importing data, tidying and cleaning the data, followed by summary statistics, visualizations, analysis and conclusions. One new feature we plan to use is plotly. This R package creates interactive plots that users can zoom, hover over and filter.

Libraries

library(tidyverse)
library(knitr)
library(ggplot2)
library(dplyr)

Import Data

As mentioned before, the main datasets we will be using are NYC rolling sales CSV and City of New York API.

nyc_rolling_sales <- read.csv("https://raw.githubusercontent.com/vincent-usny/607-final-/refs/heads/main/NYC_Rolling_Sales.csv")

nyc_api <- jsonlite::fromJSON("https://data.cityofnewyork.us/resource/5ebm-myj7.json?$limit=10000")

Data Cleaning

Preliminary data cleaning is needed for both data sets. To begin, tidying and cleaning will be performed on each data set separately. Afterwards, further cleaning will be done to make each data set appear more like each other.

NYC Rolling Sales

Begin by creating a new column to store address information. This will be used later in the data visualization stage and ggmap incorporation.

nyc_rolling_sales <- nyc_rolling_sales %>%
  mutate(address = as.character(paste(ADDRESS, ZIP.CODE, "NY")))

Secondly, remove columns that contain little to no information (i.e. are plagued with NA’s). Then, we will also remove columns that contain information irrelevant to our goals. Note that some of the columns state information at present and during time of sale, we will drop the former.

nyc_rolling_sales <- nyc_rolling_sales %>%
  select(-EASE.MENT, -APARTMENT.NUMBER, -BUILDING.CLASS.AT.PRESENT, -SALE.DATE, -RESIDENTIAL.UNITS, -COMMERCIAL.UNITS, -LOT, -BLOCK, -TAX.CLASS.AT.PRESENT, -ZIP.CODE, -ADDRESS) 

Next, information in the ‘Borough’ column will be changed to state the actual name of the borough an not its BBL designation. For reference:
1 - Manhattan
2 - Bronx
3 - Brooklyn
4 - Queens
5 - Staten Island.

nyc_rolling_sales <- nyc_rolling_sales %>%
  mutate(
    BOROUGH = case_when(
      BOROUGH == 1 ~ "Manhattan",
      BOROUGH == 2 ~ "Bronx",
      BOROUGH == 3 ~ "Brooklyn",
      BOROUGH == 4 ~ "Queens",
      BOROUGH == 5 ~ "Staten Island"
    )
  )

The last thing to perform on nyc_rolling_sales will be removing any rows that contain any missing values in any of the columns. The motivation for this is to drastically cut down on analysis frustrations that may come up later.

nyc_rolling_sales <- nyc_rolling_sales %>%
  filter(if_all(everything(), ~ !is.na(.) & . != ""))

NYC API

For the most part, the nyc_api data set is in good order, but there is still some minor cleaning needed. For starters, we will remove the year column. The missing values will also be handled at this step.

nyc_api <- nyc_api %>%
  select(-year)

nyc_api <- nyc_api %>%
  filter(if_all(everything(), ~ !is.na(.) & . != ""))

Next, modify the borough names.

nyc_api <- nyc_api %>%
  mutate(
    borough = case_when(
      borough == 1 ~ "Manhattan",
      borough == 2 ~ "Bronx",
      borough == 3 ~ "Brooklyn",
      borough == 4 ~ "Queens",
      borough == 5 ~ "Staten Island"
    )
  )

The last thing to perform on nyc_api will be adjusting the information in column ‘type_of_home’ to match those that appear in nyc_rolling_sales.

nyc_api <- nyc_api %>%
  mutate(type_of_home = case_when(
    grepl("^01[- ]*ONE FAMILY", type_of_home) ~ "01 ONE FAMILY DWELLINGS",
    grepl("^02[- ]*TWO FAMILY", type_of_home) ~ "02 TWO FAMILY DWELLINGS",
    grepl("^03[- ]*THREE FAMILY", type_of_home) ~ "03 THREE FAMILY DWELLINGS"
  ))

The final data cleaning will be done on both dataframes and will be simply renaming the columns to follow one coherent convention.

colnames(nyc_rolling_sales) <- c("borough", "neighborhood", "building category", "total units", "land square feet", "gross square feet", "year built", "tax class", "building class", "sale price", "address")

colnames(nyc_api) <- c("borough", "neighborhood", "building category", "number of sales", "lowest sale price", "average sale price", "median sale price", "highest sale price")

Data Visualization

The data will be visualized in the traditional ways, using plots and graphs, but also using a new function: ggmap. Due to the geographical nature of our data, it is beneficial to be able to display said data on a map. That being said, AI was used during the implementation of ggmap and assisted in teaching us how to use it and some minor implementation cues.

# See scratch code below

###——————————————————Pricilla—————————————– Take the leading #$ from sale price. I filter out records where the relevant field value is zero. This should be done cautiously to avoid misrepresenting the data. But if a propertys sale price is Zero then its not for sale.

# Clean numeric fields by removing non-numeric characters
nyc_rolling_sales_clean <- nyc_rolling_sales %>%
  mutate(
    `gross square feet` = as.numeric(gsub("[^0-9]", "", `gross square feet`)),
    `sale price` = as.numeric(gsub("[^0-9]", "", `sale price`)),
    `gross square feet` = as.numeric(gsub("[^0-9]", "", `gross square feet`)),
    `land square feet` = as.numeric(gsub("[^0-9]", "", `land square feet`)),
    `year built` = as.numeric(gsub("[^0-9]", "", `year built`))
  )

# 1. Property with the highest gross square feet
highest_gsf <- nyc_rolling_sales_clean %>%
  arrange(desc(`gross square feet`)) %>%
  slice(1)

# 2. Property with the lowest non-zero sale price
lowest_sale <- nyc_rolling_sales_clean %>%
  filter(`sale price` > 0) %>%
  arrange(`sale price`) %>%
  slice(1)

Which property has the lowest sale price that is not Zero?

lowest_sale_building_per_borough <- nyc_rolling_sales_clean %>%
  filter(`sale price` > 0) %>%              # remove missing / zero values
  group_by(borough) %>%
  slice_min(order_by = `sale price`, n = 1, with_ties = FALSE) %>%
  ungroup()

kable(
  lowest_sale_building_per_borough,
  caption = "Property with Lowest sale price greater than 0 by Borough"
)
Property with Lowest sale price greater than 0 by Borough
borough neighborhood building category total units land square feet gross square feet year built tax class building class sale price address
Bronx WILLIAMSBRIDGE 01 ONE FAMILY DWELLINGS 1 1751 1044 1935 1 A5 1 1050 EAST 229 STREET 10466 NY
Brooklyn CYPRESS HILLS 03 THREE FAMILY DWELLINGS 3 2761 2400 1931 1 C0 1 24 HIGHLAND PLACE 11208 NY
Manhattan FLATIRON 21 OFFICE BUILDINGS 3 2472 9104 1915 4 O5 1 8 WEST 28 STREET 10001 NY
Queens FLUSHING-NORTH 29 COMMERCIAL GARAGES 1 5226 4360 1940 4 G2 1 126-12 NORTHERN BOULEVARD 11368 NY
Staten Island ELTINGVILLE 02 TWO FAMILY DWELLINGS 2 4000 2556 1993 1 B2 1 43 MEMPHIS AVENUE 10312 NY

What is the largest building by gross square feet that is for sale by Borough?

largest_building_per_borough <- nyc_rolling_sales_clean %>%
  filter(`gross square feet` > 0) %>%              # remove missing / zero values
  group_by(borough) %>%
  slice_max(order_by = `gross square feet`, n = 1, with_ties = FALSE) %>%
  ungroup()

kable(
  largest_building_per_borough,
  caption = "Property with Highest Gross Square Feet by Borough"
)
Property with Highest Gross Square Feet by Borough
borough neighborhood building category total units land square feet gross square feet year built tax class building class sale price address
Bronx HUNTS POINT 30 WAREHOUSES 6 2245711 674889 1973 4 E9 605000 355 FOOD CENTER DRIVE, E11 10474 NY
Brooklyn BAY RIDGE 40 SELECTED GOVERNMENTAL FACILITIES 2 7446955 1545708 1931 4 Y4 2000000 4123 FORT HAMILTON 11209 NY
Manhattan CLINTON 08 RENTALS - ELEVATOR APARTMENTS 1,033 85996 1192139 2015 2 D7 0 606 WEST 57TH STREET 10019 NY
Queens MIDDLE VILLAGE 22 STORE BUILDINGS 90 522928 1004200 1922 4 K6 72000000 80-00 COOPER AVENUE 11385 NY
Staten Island ROSSVILLE-CHARLESTON 22 STORE BUILDINGS 2 538824 155282 2005 4 K8 0 2750 VETERANS ROAD WEST 10309 NY

SUMMARY STATISTICS

That will be Useful for understanding the distribution of key numeric variables. All buildings of $0 sale price are excluded.

borough_summary <- (
  nyc_rolling_sales_clean %>%
    filter(`sale price` > 0) %>%  # only include positive sale prices
    group_by(borough) %>%
    summarise(
      min_price = min(`sale price`, na.rm = TRUE),
      max_price = max(`sale price`, na.rm = TRUE),
      median_price = median(`sale price`, na.rm = TRUE),
      avg_sale_price = mean(`sale price`, na.rm = TRUE),  # NEW
      avg_gsf = mean(`gross square feet`, na.rm = TRUE),
      avg_lsf = mean(`land square feet`, na.rm = TRUE),
      avg_year_built = mean(`year built`, na.rm = TRUE),
      .groups = "drop"
    )
)
kable(borough_summary, caption = "Borough Summary Statistics")
Borough Summary Statistics
borough min_price max_price median_price avg_sale_price avg_gsf avg_lsf avg_year_built
Bronx 1 60000000 800000 1558133.3 6625.949 6470.218 1941.369
Brooklyn 1 235400000 1230000 2167320.7 6462.376 4159.629 1930.621
Manhattan 1 1080000000 5300000 17245150.1 45711.216 5278.593 1922.314
Queens 1 83558226 892250 1172771.4 2683.707 3948.714 1941.864
Staten Island 1 37605875 730000 802300.5 1891.890 4421.920 1968.474

Question

Which NYC borough has the most expensive listing based on maximum sale price, and which borough has the oldest buildings on average?

Answer

Manhattan has the most expensive listing, with a maximum sale price of $1,080,000,000 and it also has the oldest buildings on average, with an average year built of 1922.

Maximum Sale Price by Borough

ggplot(borough_summary, aes(x = borough, y = max_price, fill = borough)) +
  geom_col() +
  geom_text(
    aes(label = round(avg_year_built, 0)),
    vjust = 1.5,          # moves text inside the bar
    color = "white",      # readable against fill color
    size = 3
  ) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Maximum Sale Price by Borough",
    x = "Borough",
    y = "Max Sale Price ($)"
  ) +
  theme_minimal() +
  theme(
    text = element_text(size = 14),
    legend.position = "none"
  )

How do sale prices vary?

Distribution of Sale Prices by Borough

ggplot(
  nyc_rolling_sales_clean %>% filter(`sale price` > 0),
  aes(x = borough, y = `sale price`, fill = borough)
) +
  geom_boxplot() +
  scale_fill_manual(values = c(
    "Bronx" = "#1f78b4",
    "Brooklyn" = "#33a02c",
    "Manhattan" = "#e31a1c",
    "Queens" = "#ff7f00",
    "Staten Island" = "#6a3d9a"
  )) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Distribution of Sale Prices by Borough",
    x = "Borough",
    y = "Sale Price ($)"
  ) +
  theme_minimal() +
  theme(
    text = element_text(size = 14),
    legend.position = "none"
  )

Demonstrating that Distribution of Sale Prices by Borough Log Scale better visualises the data.

ggplot(
  nyc_rolling_sales_clean %>% filter(`sale price` > 0),
  aes(x = borough, y = `sale price`, fill = borough)
) +
  geom_boxplot() +
  scale_y_log10(labels = scales::comma) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Distribution of Sale Prices by Borough (Log Scale)",
    x = "Borough",
    y = "Sale Price ($)"
  ) +
  theme_minimal() +
  theme(
    text = element_text(size = 14),
    legend.position = "none"
  )

Most sales occur under 2 million, which is tiny compared to the highest sale nearly 1 Billion.

As a result, the boxes and medians get compressed into a thin line near zero.

So visually, you only see small black dots and very short boxplots near the bottom.

Manhattan has the highest concentration of extremely expensive sales seen by the tallest dots (outliers).

These represent Manhattan`s luxury buildings and commercial properties selling for hundreds of millions.

Brooklyn has a few outliers close to $300M.

Queens and Bronx outliers are lower (tens of millions).

Staten Island is the borough with lowest priced housing.

RANGE OF COST FOR ONE FAMILY HOME IN ALL NYC BOROUGHS.

Because the ORIGINAL dataset includes Commercial towers, Mixed use buildings, Large multifamily buildings, Luxury skyscrapers and Outlier sales in the hundreds of millions these push the y axis so high that the typical homes become invisible. So I will filter to only look at the price range of 1 family homes in all boroughs.

# Filter for one-family homes with valid sale prices
nyc_residential <- nyc_rolling_sales_clean %>%
  filter(
    `sale price` > 0,
    `building category` == "01 ONE FAMILY DWELLINGS"
  )

# Compute IQR (typical range) per borough
one_family_iqr <- nyc_residential %>%
  group_by(borough) %>%
  summarise(
    Q1 = quantile(`sale price`, 0.25, na.rm = TRUE),
    Median = median(`sale price`, na.rm = TRUE),
    Q3 = quantile(`sale price`, 0.75, na.rm = TRUE),
    IQR = IQR(`sale price`, na.rm = TRUE),
    Typical_Range = paste0(
      scales::comma(round(Q1)), " -- ", scales::comma(round(Q3))
    ),
    .groups = "drop"
  )

kable(one_family_iqr, caption = "IQR of Sale Prices for One-Family Homes")
IQR of Sale Prices for One-Family Homes
borough Q1 Median Q3 IQR Typical_Range
Bronx 500000 647500 760000 260000 500,000 – 760,000
Brooklyn 655000 895000 1475000 820000 655,000 – 1,475,000
Manhattan 4900000 7450000 11312500 6412500 4,900,000 – 11,312,500
Queens 620000 800000 993000 373000 620,000 – 993,000
Staten Island 560000 690000 825000 265000 560,000 – 825,000

Q1 (First Quartile) which is 25 percent of homes sold for a price or less, marks the lower end of the typical market. In the Bronx 25 percent of one family homes sold for less than or equal to 500k.

Q3 (Third Quartile) stands for 75 percent of homes sold for this price or less. It marks the upper end of typical home prices. In Queens 75 percent of homes sold for less than or equal to 993k.

The IQR (Interquartile Range) that is calculated by Q3 minus Q1 measures the spread of the middle 50 percent of home prices. A larger IQR shows more variability in typical home prices.

Manhattan`s IQR at 6,412,500 is extremely wide, meaning significant price variation.

1, 2, 3 Family Dwellings counts

building_counts_0123 <- nyc_rolling_sales_clean %>%
  filter(
    `building category` %in% c(
      "01 ONE FAMILY DWELLINGS",
      "02 TWO FAMILY DWELLINGS",
      "03 THREE FAMILY DWELLINGS"
    )
  ) %>%
  group_by(borough, `building category`) %>%
  summarise(count = n(), .groups = "drop")

kable(building_counts_0123, caption = "1, 2, 3 Family Dwellings counts")
1, 2, 3 Family Dwellings counts
borough building category count
Bronx 01 ONE FAMILY DWELLINGS 1212
Bronx 02 TWO FAMILY DWELLINGS 1608
Bronx 03 THREE FAMILY DWELLINGS 665
Brooklyn 01 ONE FAMILY DWELLINGS 3279
Brooklyn 02 TWO FAMILY DWELLINGS 5620
Brooklyn 03 THREE FAMILY DWELLINGS 2025
Manhattan 01 ONE FAMILY DWELLINGS 155
Manhattan 02 TWO FAMILY DWELLINGS 141
Manhattan 03 THREE FAMILY DWELLINGS 73
Queens 01 ONE FAMILY DWELLINGS 8313
Queens 02 TWO FAMILY DWELLINGS 5332
Queens 03 THREE FAMILY DWELLINGS 1320
Staten Island 01 ONE FAMILY DWELLINGS 3913
Staten Island 02 TWO FAMILY DWELLINGS 1703
Staten Island 03 THREE FAMILY DWELLINGS 49

Which Borough has more 1 family vs 2 family homes?

Queens has the most one family dwelling homes and Brooklyn has the most two family homes for sale.

ggplot(building_counts_0123, aes(x = borough, y = count, fill = `building category`)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_x_discrete(expand = expansion(mult = 0)) +
  labs(
    title = "Building Category Counts (01–03) by Borough",
    x = "Borough",
    y = "Count",
    fill = "Building Category"
  ) +
  theme_minimal(base_size = 14)

Sale Price Distribution for One Family Homes by Borough

ggplot(
  nyc_residential,
  aes(x = borough, y = `sale price`, fill = borough)
) +
  geom_boxplot() +
  scale_y_log10(labels = scales::comma) +
  labs(
    title = "Sale Price Distribution for One Family Homes by Borough",
    x = "Borough",
    y = "Sale Price ($)"
  ) +
  theme_minimal()

Manhattan consistently shows the highest median sale prices, with the box (representing the interquartile range or middle 50% of sales) positioned significantly higher than all other boroughs. the average price for single family homes in Manhattan can be many millions of dollars.

The Bronx generally has the lowest median sale prices for one family homes, making it the most affordable borough according to the data presented, where a significant number of sales have been recorded at or below $500,000. Brooklyn, Queens, and Staten Island exhibit similar sale price ranges in the middle of the distribution, with their boxes overlapping considerably on the log-scale Y-axis.

AVERAGE SALE PRICE BY BOROUGH

Staten Island has the most affordable housing options based on Average sale price.

suppressPackageStartupMessages(library(scales))
ggplot(borough_summary, aes(x = reorder(borough, avg_sale_price), y = avg_sale_price, fill = borough)) +
  geom_col(show.legend = FALSE, width = 0.7) +
  geom_text(aes(label = scales::dollar(avg_sale_price, scale = 1, suffix = "")),
            vjust = -0.5, size = 4, fontface = "bold") +
  scale_y_continuous(labels = scales::dollar_format(scale = 1, prefix = "$")) +
  expand_limits(y = max(borough_summary$avg_sale_price) * 1.1) +  # extend y-axis by 10%
  labs(
    title = "Average Sale Price by Borough in NYC",
    x = "Borough",
    y = "Average Sale Price ($)",
    caption = "Data source: NYC Rolling Sales"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.text.x = element_text(face = "bold", color = "#2C3E50"),
    axis.title = element_text(face = "bold")
  ) +
  scale_fill_brewer(palette = "Set2")

Relation ship between price and square foot Manhattan and the Bronx

In Manhattan an average of square feet of 45711 costs 17 million while in the Bronx around 6625 square feet cost 1.6 million.

ggplot(borough_summary, aes(x = avg_gsf, y = avg_sale_price)) +
  geom_point(aes(color = borough), size = 5, show.legend = FALSE) +
  geom_text(aes(label = borough), vjust = -1, fontface = "bold") +  # label points
  geom_smooth(method = "lm", se = FALSE, color = "darkblue", linetype = "dashed") +
  scale_y_continuous(labels = scales::dollar_format(scale = 1, prefix = "$")) +
  labs(
    title = "Relationship Between Average Gross Square Feet and Sale Price by Borough",
    x = "Average Gross Square Feet",
    y = "Average Sale Price ($)"
  ) +
  theme_minimal(base_size = 14)
## `geom_smooth()` using formula = 'y ~ x'

Taking a closer look at the Relationship between price and square feet in Manhattan.

#Filter for Manhattan
manhattan_summary <- borough_summary %>%
  filter(borough == "Manhattan")

# Scatter plot of avg_gsf vs avg_sale_price for Manhattan
ggplot(manhattan_summary, aes(x = avg_gsf, y = avg_sale_price)) +
  geom_point(size = 5, color = "darkred") +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue", linetype = "dashed") +
  scale_y_continuous(labels = scales::dollar_format(scale = 1, prefix = "$")) +
  labs(
    title = "Manhattan Average Gross Square Feet and Sale Price in Manhattan",
    x = "Average Gross Square Feet",
    y = "Average Sale Price ($)"
  ) +
  theme_minimal(base_size = 14)
## `geom_smooth()` using formula = 'y ~ x'

Taking a closer look at the Relation ship between price and square feet in the Bronx

# Filter for Bronx
bronx_summary <- borough_summary %>%
  filter(borough == "Bronx")

ggplot(bronx_summary, aes(x = avg_gsf, y = avg_sale_price)) +
  geom_point(size = 5, color = "darkred") +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "darkblue",
    linetype = "dashed",
    na.rm = TRUE
  ) +
  scale_y_continuous(labels = scales::dollar_format(scale = 1, prefix = "$")) +
  labs(
    title = "Bronx Average Gross Square Feet and Sale Price in Bronx",
    x = "Average Gross Square Feet",
    y = "Average Sale Price ($)"
  ) +
  theme_minimal(base_size = 14)
## `geom_smooth()` using formula = 'y ~ x'

———————————————Pricilla———————————————–

References

Tableau Visualisation https://public.tableau.com/app/profile/pricilla.nakyazze/viz/FInalWorkbook_17654919165070/Dashboard1?publish=yes “NYC Citywide Rolling Calendar Sales: NYC Open Data.” NYC Citywide Rolling Calendar Sales | NYC Open Data, Department of Finance, 18 Nov. 2025, data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt/about_data.