This project seeks to research NYC property sales patterns and identify which features may affect sales prices. Such features may include neighborhood, building type, land/gross square footage, tax class and year built. Our goal is to uncover the price distribution in the citywide and trends that could help predict future property values.
Two data sources will be used: a csv file of NYC rolling sales (originally NYC Open Data) and the City of New York API endpoint (Department of Finance). The workflow will include importing data, tidying and cleaning the data, followed by summary statistics, visualizations, analysis and conclusions. One new feature we plan to use is plotly. This R package creates interactive plots that users can zoom, hover over and filter.
library(tidyverse)
library(knitr)
library(ggplot2)
library(dplyr)
As mentioned before, the main datasets we will be using are NYC rolling sales CSV and City of New York API.
nyc_rolling_sales <- read.csv("https://raw.githubusercontent.com/vincent-usny/607-final-/refs/heads/main/NYC_Rolling_Sales.csv")
nyc_api <- jsonlite::fromJSON("https://data.cityofnewyork.us/resource/5ebm-myj7.json?$limit=10000")
Preliminary data cleaning is needed for both data sets. To begin, tidying and cleaning will be performed on each data set separately. Afterwards, further cleaning will be done to make each data set appear more like each other.
Begin by creating a new column to store address information. This will be used later in the data visualization stage and ggmap incorporation.
nyc_rolling_sales <- nyc_rolling_sales %>%
mutate(address = as.character(paste(ADDRESS, ZIP.CODE, "NY")))
Secondly, remove columns that contain little to no information (i.e. are plagued with NA’s). Then, we will also remove columns that contain information irrelevant to our goals. Note that some of the columns state information at present and during time of sale, we will drop the former.
nyc_rolling_sales <- nyc_rolling_sales %>%
select(-EASE.MENT, -APARTMENT.NUMBER, -BUILDING.CLASS.AT.PRESENT, -SALE.DATE, -RESIDENTIAL.UNITS, -COMMERCIAL.UNITS, -LOT, -BLOCK, -TAX.CLASS.AT.PRESENT, -ZIP.CODE, -ADDRESS)
Next, information in the ‘Borough’ column will be changed to state
the actual name of the borough an not its BBL designation. For
reference:
1 - Manhattan
2 - Bronx
3 - Brooklyn
4 -
Queens
5 - Staten Island.
nyc_rolling_sales <- nyc_rolling_sales %>%
mutate(
BOROUGH = case_when(
BOROUGH == 1 ~ "Manhattan",
BOROUGH == 2 ~ "Bronx",
BOROUGH == 3 ~ "Brooklyn",
BOROUGH == 4 ~ "Queens",
BOROUGH == 5 ~ "Staten Island"
)
)
The last thing to perform on nyc_rolling_sales will be removing any rows that contain any missing values in any of the columns. The motivation for this is to drastically cut down on analysis frustrations that may come up later.
nyc_rolling_sales <- nyc_rolling_sales %>%
filter(if_all(everything(), ~ !is.na(.) & . != ""))
For the most part, the nyc_api data set is in good order, but there is still some minor cleaning needed. For starters, we will remove the year column. The missing values will also be handled at this step.
nyc_api <- nyc_api %>%
select(-year)
nyc_api <- nyc_api %>%
filter(if_all(everything(), ~ !is.na(.) & . != ""))
Next, modify the borough names.
nyc_api <- nyc_api %>%
mutate(
borough = case_when(
borough == 1 ~ "Manhattan",
borough == 2 ~ "Bronx",
borough == 3 ~ "Brooklyn",
borough == 4 ~ "Queens",
borough == 5 ~ "Staten Island"
)
)
The last thing to perform on nyc_api will be adjusting the information in column ‘type_of_home’ to match those that appear in nyc_rolling_sales.
nyc_api <- nyc_api %>%
mutate(type_of_home = case_when(
grepl("^01[- ]*ONE FAMILY", type_of_home) ~ "01 ONE FAMILY DWELLINGS",
grepl("^02[- ]*TWO FAMILY", type_of_home) ~ "02 TWO FAMILY DWELLINGS",
grepl("^03[- ]*THREE FAMILY", type_of_home) ~ "03 THREE FAMILY DWELLINGS"
))
The final data cleaning will be done on both dataframes and will be simply renaming the columns to follow one coherent convention.
colnames(nyc_rolling_sales) <- c("borough", "neighborhood", "building category", "total units", "land square feet", "gross square feet", "year built", "tax class", "building class", "sale price", "address")
colnames(nyc_api) <- c("borough", "neighborhood", "building category", "number of sales", "lowest sale price", "average sale price", "median sale price", "highest sale price")
The data will be visualized in the traditional ways, using plots and graphs, but also using a new function: ggmap. Due to the geographical nature of our data, it is beneficial to be able to display said data on a map. That being said, AI was used during the implementation of ggmap and assisted in teaching us how to use it and some minor implementation cues.
# See scratch code below
###——————————————————Pricilla—————————————– Take the leading #$ from
sale price. I filter out records where the relevant field value is zero.
This should be done cautiously to avoid misrepresenting the data. But if
a propertys sale price is Zero then its not for sale.
# Clean numeric fields by removing non-numeric characters
nyc_rolling_sales_clean <- nyc_rolling_sales %>%
mutate(
`gross square feet` = as.numeric(gsub("[^0-9]", "", `gross square feet`)),
`sale price` = as.numeric(gsub("[^0-9]", "", `sale price`)),
`gross square feet` = as.numeric(gsub("[^0-9]", "", `gross square feet`)),
`land square feet` = as.numeric(gsub("[^0-9]", "", `land square feet`)),
`year built` = as.numeric(gsub("[^0-9]", "", `year built`))
)
# 1. Property with the highest gross square feet
highest_gsf <- nyc_rolling_sales_clean %>%
arrange(desc(`gross square feet`)) %>%
slice(1)
# 2. Property with the lowest non-zero sale price
lowest_sale <- nyc_rolling_sales_clean %>%
filter(`sale price` > 0) %>%
arrange(`sale price`) %>%
slice(1)
Which property has the lowest sale price that is not Zero?
lowest_sale_building_per_borough <- nyc_rolling_sales_clean %>%
filter(`sale price` > 0) %>% # remove missing / zero values
group_by(borough) %>%
slice_min(order_by = `sale price`, n = 1, with_ties = FALSE) %>%
ungroup()
kable(
lowest_sale_building_per_borough,
caption = "Property with Lowest sale price greater than 0 by Borough"
)
| borough | neighborhood | building category | total units | land square feet | gross square feet | year built | tax class | building class | sale price | address |
|---|---|---|---|---|---|---|---|---|---|---|
| Bronx | WILLIAMSBRIDGE | 01 ONE FAMILY DWELLINGS | 1 | 1751 | 1044 | 1935 | 1 | A5 | 1 | 1050 EAST 229 STREET 10466 NY |
| Brooklyn | CYPRESS HILLS | 03 THREE FAMILY DWELLINGS | 3 | 2761 | 2400 | 1931 | 1 | C0 | 1 | 24 HIGHLAND PLACE 11208 NY |
| Manhattan | FLATIRON | 21 OFFICE BUILDINGS | 3 | 2472 | 9104 | 1915 | 4 | O5 | 1 | 8 WEST 28 STREET 10001 NY |
| Queens | FLUSHING-NORTH | 29 COMMERCIAL GARAGES | 1 | 5226 | 4360 | 1940 | 4 | G2 | 1 | 126-12 NORTHERN BOULEVARD 11368 NY |
| Staten Island | ELTINGVILLE | 02 TWO FAMILY DWELLINGS | 2 | 4000 | 2556 | 1993 | 1 | B2 | 1 | 43 MEMPHIS AVENUE 10312 NY |
What is the largest building by gross square feet that is for sale by Borough?
largest_building_per_borough <- nyc_rolling_sales_clean %>%
filter(`gross square feet` > 0) %>% # remove missing / zero values
group_by(borough) %>%
slice_max(order_by = `gross square feet`, n = 1, with_ties = FALSE) %>%
ungroup()
kable(
largest_building_per_borough,
caption = "Property with Highest Gross Square Feet by Borough"
)
| borough | neighborhood | building category | total units | land square feet | gross square feet | year built | tax class | building class | sale price | address |
|---|---|---|---|---|---|---|---|---|---|---|
| Bronx | HUNTS POINT | 30 WAREHOUSES | 6 | 2245711 | 674889 | 1973 | 4 | E9 | 605000 | 355 FOOD CENTER DRIVE, E11 10474 NY |
| Brooklyn | BAY RIDGE | 40 SELECTED GOVERNMENTAL FACILITIES | 2 | 7446955 | 1545708 | 1931 | 4 | Y4 | 2000000 | 4123 FORT HAMILTON 11209 NY |
| Manhattan | CLINTON | 08 RENTALS - ELEVATOR APARTMENTS | 1,033 | 85996 | 1192139 | 2015 | 2 | D7 | 0 | 606 WEST 57TH STREET 10019 NY |
| Queens | MIDDLE VILLAGE | 22 STORE BUILDINGS | 90 | 522928 | 1004200 | 1922 | 4 | K6 | 72000000 | 80-00 COOPER AVENUE 11385 NY |
| Staten Island | ROSSVILLE-CHARLESTON | 22 STORE BUILDINGS | 2 | 538824 | 155282 | 2005 | 4 | K8 | 0 | 2750 VETERANS ROAD WEST 10309 NY |
SUMMARY STATISTICS
That will be Useful for understanding the distribution of key numeric variables. All buildings of $0 sale price are excluded.
borough_summary <- (
nyc_rolling_sales_clean %>%
filter(`sale price` > 0) %>% # only include positive sale prices
group_by(borough) %>%
summarise(
min_price = min(`sale price`, na.rm = TRUE),
max_price = max(`sale price`, na.rm = TRUE),
median_price = median(`sale price`, na.rm = TRUE),
avg_sale_price = mean(`sale price`, na.rm = TRUE), # NEW
avg_gsf = mean(`gross square feet`, na.rm = TRUE),
avg_lsf = mean(`land square feet`, na.rm = TRUE),
avg_year_built = mean(`year built`, na.rm = TRUE),
.groups = "drop"
)
)
kable(borough_summary, caption = "Borough Summary Statistics")
| borough | min_price | max_price | median_price | avg_sale_price | avg_gsf | avg_lsf | avg_year_built |
|---|---|---|---|---|---|---|---|
| Bronx | 1 | 60000000 | 800000 | 1558133.3 | 6625.949 | 6470.218 | 1941.369 |
| Brooklyn | 1 | 235400000 | 1230000 | 2167320.7 | 6462.376 | 4159.629 | 1930.621 |
| Manhattan | 1 | 1080000000 | 5300000 | 17245150.1 | 45711.216 | 5278.593 | 1922.314 |
| Queens | 1 | 83558226 | 892250 | 1172771.4 | 2683.707 | 3948.714 | 1941.864 |
| Staten Island | 1 | 37605875 | 730000 | 802300.5 | 1891.890 | 4421.920 | 1968.474 |
Question
Which NYC borough has the most expensive listing based on maximum sale price, and which borough has the oldest buildings on average?
Answer
Manhattan has the most expensive listing, with a maximum sale price of $1,080,000,000 and it also has the oldest buildings on average, with an average year built of 1922.
Maximum Sale Price by Borough
ggplot(borough_summary, aes(x = borough, y = max_price, fill = borough)) +
geom_col() +
geom_text(
aes(label = round(avg_year_built, 0)),
vjust = 1.5, # moves text inside the bar
color = "white", # readable against fill color
size = 3
) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Maximum Sale Price by Borough",
x = "Borough",
y = "Max Sale Price ($)"
) +
theme_minimal() +
theme(
text = element_text(size = 14),
legend.position = "none"
)
How do sale prices vary?
Distribution of Sale Prices by Borough
ggplot(
nyc_rolling_sales_clean %>% filter(`sale price` > 0),
aes(x = borough, y = `sale price`, fill = borough)
) +
geom_boxplot() +
scale_fill_manual(values = c(
"Bronx" = "#1f78b4",
"Brooklyn" = "#33a02c",
"Manhattan" = "#e31a1c",
"Queens" = "#ff7f00",
"Staten Island" = "#6a3d9a"
)) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Distribution of Sale Prices by Borough",
x = "Borough",
y = "Sale Price ($)"
) +
theme_minimal() +
theme(
text = element_text(size = 14),
legend.position = "none"
)
Demonstrating that Distribution of Sale Prices by Borough Log Scale better visualises the data.
ggplot(
nyc_rolling_sales_clean %>% filter(`sale price` > 0),
aes(x = borough, y = `sale price`, fill = borough)
) +
geom_boxplot() +
scale_y_log10(labels = scales::comma) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Distribution of Sale Prices by Borough (Log Scale)",
x = "Borough",
y = "Sale Price ($)"
) +
theme_minimal() +
theme(
text = element_text(size = 14),
legend.position = "none"
)
Most sales occur under 2 million, which is tiny compared to the highest sale nearly 1 Billion.
As a result, the boxes and medians get compressed into a thin line near zero.
So visually, you only see small black dots and very short boxplots near the bottom.
Manhattan has the highest concentration of extremely expensive sales seen by the tallest dots (outliers).
These represent Manhattan`s luxury buildings and commercial properties selling for hundreds of millions.
Brooklyn has a few outliers close to $300M.
Queens and Bronx outliers are lower (tens of millions).
Staten Island is the borough with lowest priced housing.
RANGE OF COST FOR ONE FAMILY HOME IN ALL NYC BOROUGHS.
Because the ORIGINAL dataset includes Commercial towers, Mixed use buildings, Large multifamily buildings, Luxury skyscrapers and Outlier sales in the hundreds of millions these push the y axis so high that the typical homes become invisible. So I will filter to only look at the price range of 1 family homes in all boroughs.
# Filter for one-family homes with valid sale prices
nyc_residential <- nyc_rolling_sales_clean %>%
filter(
`sale price` > 0,
`building category` == "01 ONE FAMILY DWELLINGS"
)
# Compute IQR (typical range) per borough
one_family_iqr <- nyc_residential %>%
group_by(borough) %>%
summarise(
Q1 = quantile(`sale price`, 0.25, na.rm = TRUE),
Median = median(`sale price`, na.rm = TRUE),
Q3 = quantile(`sale price`, 0.75, na.rm = TRUE),
IQR = IQR(`sale price`, na.rm = TRUE),
Typical_Range = paste0(
scales::comma(round(Q1)), " -- ", scales::comma(round(Q3))
),
.groups = "drop"
)
kable(one_family_iqr, caption = "IQR of Sale Prices for One-Family Homes")
| borough | Q1 | Median | Q3 | IQR | Typical_Range |
|---|---|---|---|---|---|
| Bronx | 500000 | 647500 | 760000 | 260000 | 500,000 – 760,000 |
| Brooklyn | 655000 | 895000 | 1475000 | 820000 | 655,000 – 1,475,000 |
| Manhattan | 4900000 | 7450000 | 11312500 | 6412500 | 4,900,000 – 11,312,500 |
| Queens | 620000 | 800000 | 993000 | 373000 | 620,000 – 993,000 |
| Staten Island | 560000 | 690000 | 825000 | 265000 | 560,000 – 825,000 |
Q1 (First Quartile) which is 25 percent of homes sold for a price or less, marks the lower end of the typical market. In the Bronx 25 percent of one family homes sold for less than or equal to 500k.
Q3 (Third Quartile) stands for 75 percent of homes sold for this price or less. It marks the upper end of typical home prices. In Queens 75 percent of homes sold for less than or equal to 993k.
The IQR (Interquartile Range) that is calculated by Q3 minus Q1 measures the spread of the middle 50 percent of home prices. A larger IQR shows more variability in typical home prices.
Manhattan`s IQR at 6,412,500 is extremely wide, meaning significant price variation.
1, 2, 3 Family Dwellings counts
building_counts_0123 <- nyc_rolling_sales_clean %>%
filter(
`building category` %in% c(
"01 ONE FAMILY DWELLINGS",
"02 TWO FAMILY DWELLINGS",
"03 THREE FAMILY DWELLINGS"
)
) %>%
group_by(borough, `building category`) %>%
summarise(count = n(), .groups = "drop")
kable(building_counts_0123, caption = "1, 2, 3 Family Dwellings counts")
| borough | building category | count |
|---|---|---|
| Bronx | 01 ONE FAMILY DWELLINGS | 1212 |
| Bronx | 02 TWO FAMILY DWELLINGS | 1608 |
| Bronx | 03 THREE FAMILY DWELLINGS | 665 |
| Brooklyn | 01 ONE FAMILY DWELLINGS | 3279 |
| Brooklyn | 02 TWO FAMILY DWELLINGS | 5620 |
| Brooklyn | 03 THREE FAMILY DWELLINGS | 2025 |
| Manhattan | 01 ONE FAMILY DWELLINGS | 155 |
| Manhattan | 02 TWO FAMILY DWELLINGS | 141 |
| Manhattan | 03 THREE FAMILY DWELLINGS | 73 |
| Queens | 01 ONE FAMILY DWELLINGS | 8313 |
| Queens | 02 TWO FAMILY DWELLINGS | 5332 |
| Queens | 03 THREE FAMILY DWELLINGS | 1320 |
| Staten Island | 01 ONE FAMILY DWELLINGS | 3913 |
| Staten Island | 02 TWO FAMILY DWELLINGS | 1703 |
| Staten Island | 03 THREE FAMILY DWELLINGS | 49 |
Which Borough has more 1 family vs 2 family homes?
Queens has the most one family dwelling homes and Brooklyn has the most two family homes for sale.
ggplot(building_counts_0123, aes(x = borough, y = count, fill = `building category`)) +
geom_bar(stat = "identity", position = "dodge") +
scale_x_discrete(expand = expansion(mult = 0)) +
labs(
title = "Building Category Counts (01–03) by Borough",
x = "Borough",
y = "Count",
fill = "Building Category"
) +
theme_minimal(base_size = 14)
Sale Price Distribution for One Family Homes by Borough
ggplot(
nyc_residential,
aes(x = borough, y = `sale price`, fill = borough)
) +
geom_boxplot() +
scale_y_log10(labels = scales::comma) +
labs(
title = "Sale Price Distribution for One Family Homes by Borough",
x = "Borough",
y = "Sale Price ($)"
) +
theme_minimal()
Manhattan consistently shows the highest median sale prices, with the box (representing the interquartile range or middle 50% of sales) positioned significantly higher than all other boroughs. the average price for single family homes in Manhattan can be many millions of dollars.
The Bronx generally has the lowest median sale prices for one family homes, making it the most affordable borough according to the data presented, where a significant number of sales have been recorded at or below $500,000. Brooklyn, Queens, and Staten Island exhibit similar sale price ranges in the middle of the distribution, with their boxes overlapping considerably on the log-scale Y-axis.
AVERAGE SALE PRICE BY BOROUGH
Staten Island has the most affordable housing options based on Average sale price.
suppressPackageStartupMessages(library(scales))
ggplot(borough_summary, aes(x = reorder(borough, avg_sale_price), y = avg_sale_price, fill = borough)) +
geom_col(show.legend = FALSE, width = 0.7) +
geom_text(aes(label = scales::dollar(avg_sale_price, scale = 1, suffix = "")),
vjust = -0.5, size = 4, fontface = "bold") +
scale_y_continuous(labels = scales::dollar_format(scale = 1, prefix = "$")) +
expand_limits(y = max(borough_summary$avg_sale_price) * 1.1) + # extend y-axis by 10%
labs(
title = "Average Sale Price by Borough in NYC",
x = "Borough",
y = "Average Sale Price ($)",
caption = "Data source: NYC Rolling Sales"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.text.x = element_text(face = "bold", color = "#2C3E50"),
axis.title = element_text(face = "bold")
) +
scale_fill_brewer(palette = "Set2")
Relation ship between price and square foot Manhattan and the Bronx
In Manhattan an average of square feet of 45711 costs 17 million while in the Bronx around 6625 square feet cost 1.6 million.
ggplot(borough_summary, aes(x = avg_gsf, y = avg_sale_price)) +
geom_point(aes(color = borough), size = 5, show.legend = FALSE) +
geom_text(aes(label = borough), vjust = -1, fontface = "bold") + # label points
geom_smooth(method = "lm", se = FALSE, color = "darkblue", linetype = "dashed") +
scale_y_continuous(labels = scales::dollar_format(scale = 1, prefix = "$")) +
labs(
title = "Relationship Between Average Gross Square Feet and Sale Price by Borough",
x = "Average Gross Square Feet",
y = "Average Sale Price ($)"
) +
theme_minimal(base_size = 14)
## `geom_smooth()` using formula = 'y ~ x'
Taking a closer look at the Relationship between price and square feet in Manhattan.
#Filter for Manhattan
manhattan_summary <- borough_summary %>%
filter(borough == "Manhattan")
# Scatter plot of avg_gsf vs avg_sale_price for Manhattan
ggplot(manhattan_summary, aes(x = avg_gsf, y = avg_sale_price)) +
geom_point(size = 5, color = "darkred") +
geom_smooth(method = "lm", se = FALSE, color = "darkblue", linetype = "dashed") +
scale_y_continuous(labels = scales::dollar_format(scale = 1, prefix = "$")) +
labs(
title = "Manhattan Average Gross Square Feet and Sale Price in Manhattan",
x = "Average Gross Square Feet",
y = "Average Sale Price ($)"
) +
theme_minimal(base_size = 14)
## `geom_smooth()` using formula = 'y ~ x'
Taking a closer look at the Relation ship between price and square feet in the Bronx
# Filter for Bronx
bronx_summary <- borough_summary %>%
filter(borough == "Bronx")
ggplot(bronx_summary, aes(x = avg_gsf, y = avg_sale_price)) +
geom_point(size = 5, color = "darkred") +
geom_smooth(
method = "lm",
se = FALSE,
color = "darkblue",
linetype = "dashed",
na.rm = TRUE
) +
scale_y_continuous(labels = scales::dollar_format(scale = 1, prefix = "$")) +
labs(
title = "Bronx Average Gross Square Feet and Sale Price in Bronx",
x = "Average Gross Square Feet",
y = "Average Sale Price ($)"
) +
theme_minimal(base_size = 14)
## `geom_smooth()` using formula = 'y ~ x'
Tableau Visualisation https://public.tableau.com/app/profile/pricilla.nakyazze/viz/FInalWorkbook_17654919165070/Dashboard1?publish=yes “NYC Citywide Rolling Calendar Sales: NYC Open Data.” NYC Citywide Rolling Calendar Sales | NYC Open Data, Department of Finance, 18 Nov. 2025, data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt/about_data.