New York City Property Market Part II

Overview

This project seeks to research NYC property sales patterns and identify which features may affect sales prices. Such features may include neighborhood, building type, land/gross square footage, tax class and year built. Our goal is to uncover the price distribution in the citywide and trends that could help predict future property values.

Two data sources will be used: a csv file of NYC rolling sales (citation needed) and the City of New York API endpoint (Department of Finance). The workflow will include importing data, tidying and cleaning the data, followed by summary statistics, visualizations, analysis and conclusions. One new feature we plan to use is plotly. This R package created interactive plots which offered more functions than regular plots.

Libraries

library(tidyverse)
library(plotly)

Import Data

As mentioned before, the main datasets we will be using are NYC rolling sales CSV and City of New York API.

nyc_rolling_sales <- read.csv("https://raw.githubusercontent.com/vincent-usny/607-final-/refs/heads/main/NYC_Rolling_Sales.csv")

nyc_api <- jsonlite::fromJSON("https://data.cityofnewyork.us/resource/5ebm-myj7.json?$limit=10000")

Data Cleaning

Preliminary data cleaning is needed for both data sets. To begin, tidying and cleaning will be performed on each data set separately. Afterwards, further cleaning will be done to make each data set appear more like each other.

NYC Rolling Sales

Begin by creating a new column to store address information.

nyc_rolling_sales <- nyc_rolling_sales %>%
  mutate(address = as.character(paste(ADDRESS, ZIP.CODE, "NY")))

Secondly, remove columns that contain little to no information (i.e. are plagued with NA’s). Then, we will also remove columns that contain information irrelevant to our goals. Note that some of the columns state information at present and during time of sale, we will drop the former.

nyc_rolling_sales <- nyc_rolling_sales %>%
  select(-EASE.MENT, -APARTMENT.NUMBER, -BUILDING.CLASS.AT.PRESENT, -RESIDENTIAL.UNITS, -COMMERCIAL.UNITS, -LOT, -BLOCK, -TAX.CLASS.AT.PRESENT, -ZIP.CODE, -ADDRESS)

Next, information in the ‘Borough’ column will be changed to state the actual name of the borough an not its BBL designation. For reference:
1 - Manhattan
2 - Bronx
3 - Brooklyn
4 - Queens
5 - Staten Island.

nyc_rolling_sales <- nyc_rolling_sales %>%
  mutate(
    BOROUGH = case_when(
      BOROUGH == 1 ~ "Manhattan",
      BOROUGH == 2 ~ "Bronx",
      BOROUGH == 3 ~ "Brooklyn",
      BOROUGH == 4 ~ "Queens",
      BOROUGH == 5 ~ "Staten Island"
    )
  )

The last thing to perform on nyc_rolling_sales will be removing any rows that contain any missing values in any of the columns. The motivation for this is to drastically cut down on analysis frustrations that may come up later.

nyc_rolling_sales <- nyc_rolling_sales %>%
  filter(if_all(everything(), ~ !is.na(.) & . != ""))

NYC API

For the most part, the nyc_api data set is in good order, but there is still some minor cleaning needed. For starters, we will remove the year column. The missing values will also be handled at this step.

nyc_api <- nyc_api %>%
  select(-year)

nyc_api <- nyc_api %>%
  filter(if_all(everything(), ~ !is.na(.) & . != ""))

Next, modify the borough names.

nyc_api <- nyc_api %>%
  mutate(
    borough = case_when(
      borough == 1 ~ "Manhattan",
      borough == 2 ~ "Bronx",
      borough == 3 ~ "Brooklyn",
      borough == 4 ~ "Queens",
      borough == 5 ~ "Staten Island"
    )
  )

The last thing to perform on nyc_api will be adjusting the information in column ‘type_of_home’ to match those that appear in nyc_rolling_sales.

nyc_api <- nyc_api %>%
  mutate(type_of_home = case_when(
    grepl("^01[- ]*ONE FAMILY", type_of_home) ~ "01 ONE FAMILY DWELLINGS",
    grepl("^02[- ]*TWO FAMILY", type_of_home) ~ "02 TWO FAMILY DWELLINGS",
    grepl("^03[- ]*THREE FAMILY", type_of_home) ~ "03 THREE FAMILY DWELLINGS"
  ))

The final data cleaning will be done on both dataframes and will be simply renaming the columns to follow one coherent convention.

colnames(nyc_rolling_sales) <- c("borough", "neighborhood", "building_type", "total_units", "land_square_feet", "gross_square_feet", "year_built", "tax_class", "building_class", "sale_price", "sale_date","address")

colnames(nyc_api) <- c("borough", "neighborhood", "building_type", "number_of_sales", "lowest_sale_price", "average_sale_price", "median_sale_price", "highest_sale_price")

# convert to numeric values
nyc_rolling_sales <- nyc_rolling_sales %>%
  mutate(
    sale_price = as.numeric(gsub("[^0-9]", "", sale_price)),
    gross_square_feet = as.numeric(gsub("[^0-9]", "", gross_square_feet))
  )

———————————————Haoming———————————————–

This part focused on 1-, 2-, 3-family dwellings across boroughs.

Data Visualization

Because of the extreme outliers, we chose to narrow the price range and gross square footage. We also limited the data to 1-,2-,3-family homes.

#filter out outliers
filtered_sales <- nyc_rolling_sales %>%
  filter(
    sale_price >= 10000,
    sale_price <= 5000000,
    gross_square_feet > 0,
    gross_square_feet <= 10000,
    building_type %in% c(
      "01 ONE FAMILY DWELLINGS",
      "02 TWO FAMILY DWELLINGS",
      "03 THREE FAMILY DWELLINGS"
    )
  )

Boxplot: Median Price and IQR by Borough

ggplot(filtered_sales, aes(x = reorder(borough, sale_price, FUN = median), y = sale_price)) +
  geom_boxplot(outlier.alpha = 0.2) +
  stat_summary(
    fun = median,
    geom = "text",
    aes(label = scales::comma(..y..)),
    vjust = -0.5,
    size = 3
  ) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Sale Price Distribution by Borough",
    x = "Borough",
    y = "Sale Price"
  )

The median sale prices highlighted how property values vary across boroughs. Staten Island and the Bronx had the lowest medians, around $737,000 and $765,000, respectively, followed by Queens at $890,000 and Brooklyn at $1,165,000. Manhattan stands out with a much higher median of $2,100,000.

Distribution of Sale Price by Borough

ggplot(filtered_sales, aes(x = sale_price/1000000)) +
  geom_histogram(bins = 40) +
  scale_x_continuous(labels = scales::comma) +
  facet_wrap(~ borough, scales = "free_y") +
  labs(
    title = "Sale Price Distribution by Borough",
    x = "Sale Price in Million",
    y = "Count"
  )

In the histogram, Brooklyn has a larger, more right-skewed distribution, while the Bronx, Queens, and Staten Island are smaller and peak near $1 million. Manhattan’s prices are more scattered.

Scatter Plot: Sale Price vs Gross Square Feet

ggplot(filtered_sales, aes(x = gross_square_feet, y = sale_price/1000000)) +
  geom_point(alpha = 0.4, size = 1) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.7) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~ borough, scales = "free") +
  labs(
    title = "Sale Price vs Gross Square Feet by Borough",
    x = "Gross Square Feet",
    y = "Sale Price in Million"
  )

## `geom_smooth()` using formula = 'y ~ x'

Brooklyn shows a steep slope with the most points, while the Bronx, Queens, and Staten Island have clear positive trends, with Queens having more transactions. Manhattan shows no clear pattern.

From the above plots, we can see that Manhattan’s scatter in the dwelling plot due to its diverse property mix and wide price range. Unlike other boroughs dominated by one- to three-family homes, Manhattan includes high-end condos, multi-unit residential buildings, and commercial properties, which vary greatly in both size and sale price, creating a less consistent pattern.

Trend: Counts of Montly Sales by Boroughs

A new feature I used in this project was plotly. Plotly could create interactive plots, hovering over to see the values and zooming in or out.

filtered_sales$sale_date <- as.Date(filtered_sales$sale_date, format = "%m/%d/%Y")

monthly_counts <- filtered_sales %>%
  mutate(year_month = format(sale_date, "%Y-%m")) %>%
  group_by(borough, year_month) %>%        # add borough here
  summarise(n_sales = n(), .groups = "drop") %>%
  arrange(year_month, borough)

plot_ly(
  data = monthly_counts,
  x = ~as.Date(paste0(year_month, "-01")),
  y = ~n_sales,
  color = ~borough,                       
  type = 'scatter',
  mode = 'lines+markers',
  text = ~paste(
    "Borough:", borough,
    "<br>Month:", year_month,
    "<br>Sales:", n_sales
  ),
  hoverinfo = 'text'
) %>%
  layout(
    title = "Monthly Number of Family Dwellings Sales by Borough",
    xaxis = list(title = ""),
    yaxis = list(title = "Number of Sales")
  )

Although Brooklyn had higher total and median sale prices than Queens, Queens recorded the most family-dwelling sales over the past year. The boroughs also showed a similar monthly pattern, with sales declining after July 2025 and hitting a low point in September 2025. Manhattan was the exception, since its trend did not follow the same drop.

DATA 607 Final Project

James Chun, Pricilla Nakyazze, Haoming Chen