New York City Real Estate Price

Introduction

People relocating for work, moving to another city from another, or simply planning a vacation can benefit from knowing details about real estate prices in different cities and areas. Understanding the cost of living and having an idea about the real estate market can help individuals and families make or break a decision to move. As a city with a vast amount of job opportunities for different industries that can satisfy one’s needs and pursue one’s aspirations and one of the most popular destinations for tourism globally, New York City attracts thousands and millions of people to reside and visit each year. However, New York City is also knowns as the most expensive city to live in in the United States. According to a biannual report by the Economist Intelligence Unit in 2021, The Big Apple ranks 6th as the most expensive city in the world (https://www.eiu.com/n/tel-aviv-overtakes-paris-to-become-worlds-most-expensive-city/). Based on my dataset, the median cost of homes in the five boroughs of New York City is $815,000, compared with the national median of $346,900 in 2021 (https://www.cnn.com/2022/01/20/homes/us-nar-home-sales-december-and-2021/index.html).

This project aims to create a clear visualization of the price of properties sold between April 2021 to February 2022 in New York City by analyzing geospatial data and making various maps reflecting property prices in different parts of New York City. Five datasets from the New York City Department of Finance have been used for this project (https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page). Each includes New York City sales data for a borough, including Manhattan, Bronx, Brooklyn, Queens, and Staten Island.

Before Importing

Before importing the datasets into R, I first deleted some columns in excel that are not related to the interest of my project, including tax class at present, block, lot, easement, building class at present, address, apartment number, residential units, commercial units, total units, land square, feet, tax class at the time of sale, building class at the time of sale, and sale date.

Importing and Cleaning Datasets

I imported the updated five datasets into R in the following code and did further steps to clean the data.

Manhattan

library(dplyr)
library(readxl)
manhattan <- read_excel("Map Project/manhattan.xlsx", 
                        skip = 1)
View(manhattan)

count(manhattan)

## # A tibble: 1 × 1
##       n
##   <int>
## 1 23615

sum(is.na(manhattan$`GROSS SQUARE FEET`))

## [1] 21718

sum(is.na(manhattan$`YEAR BUILT`))

## [1] 2437

manhattan %>%
  filter(`SALE PRICE` >=1000) %>%
  subset(select = -c(`GROSS SQUARE FEET`,`YEAR BUILT`)) -> manhattan2
View(manhattan2)

After counting the amount of data for Manhattan in general, the amount of data in the GROSS SQUARE FEET column, and the YEAR BUILT column in dataset manhattan, I found that these two columns contain a large number of missing data which will significantly impact the results if I analyze data from these two columns. Therefore, I decided to exclude GROSS SQUARE FEET and YEAR BUILT columns in my dataset, using the select function.

Additionally, I have also realized that some properties were sold at an unreasonable low price that could not correctly reflect the real estate market in New York City. I conjecture that they might be sold between family members and friends or bestowed on someone. It is necessary to exclude these properties from my data analysis. Hens, I used the filter function to filter out any properties sold under $1,000 for dataset manhattan. The new dataset for Manhattan containing properties sold equal to or over $1,000 without column GROSS SQUARE FEET and column YEAR BUILT is called manhattan2.

Same issues have been observed in the other four datasets, representing properties sold in the Bronx, Brooklyn, Queens, and Staten Island. Many missing data are noticed for columns GROSS SQUARE FEET and YEAR BUILT and properties sold at significantly low prices.

Bronx

library(readxl)
bronx <- read_excel("Map Project/bronx.xlsx", 
                    skip = 1)
View(bronx)

count(bronx)

## # A tibble: 1 × 1
##       n
##   <int>
## 1  7490

sum(is.na(bronx$`GROSS SQUARE FEET`))

## [1] 1929

sum(is.na(bronx$`YEAR BUILT`))

## [1] 631

bronx %>%
  filter(`SALE PRICE` >=1000) %>%
  subset(select = -c(`GROSS SQUARE FEET`,`YEAR BUILT`)) -> bronx2
View(bronx2)

With 7,490 amounts of data in dataset Bronx, there are 1,929 missing data for GROSS SQUARE FEET and 631 missing data for YEAR BUILT.

Brooklyn

library(readxl)
brooklyn <- read_excel("Map Project/brooklyn.xlsx", 
                       skip = 1)
View(brooklyn)

count(brooklyn)

## # A tibble: 1 × 1
##       n
##   <int>
## 1 28398

sum(is.na(brooklyn$`GROSS SQUARE FEET`))

## [1] 11762

sum(is.na(brooklyn$`YEAR BUILT`))

## [1] 2314

brooklyn %>%
  filter(`SALE PRICE` >=1000) %>%
  subset(select = -c(`GROSS SQUARE FEET`,`YEAR BUILT`)) -> brooklyn2
View(brooklyn2)

Dataset brooklyn contains 28,398 amounts of data, with missing data of 11,762 for GROSS SQUARE FEET and 2,314 for YEAR BUILT.

Queens

library(readxl)
queens <- read_excel("Map Project/queens.xlsx", 
                     skip = 1)
View(queens)

count(queens)

## # A tibble: 1 × 1
##       n
##   <int>
## 1 29219

sum(is.na(queens$`GROSS SQUARE FEET`))

## [1] 9900

sum(is.na(queens$`YEAR BUILT`))

## [1] 1210

queens %>%
  filter(`SALE PRICE` >=1000) %>%
  subset(select = -c(`GROSS SQUARE FEET`,`YEAR BUILT`)) -> queens2
View(queens2)

For Queens, there are 29,219 data in the dataset; however, having 9,900 missing data for GROSS SQUARE FEET and 1,210 missing data for YEAR BUILT.

Staten Island

library(readxl)
staten <- read_excel("Map Project/statenisland.xlsx", 
                     skip = 1)
View(staten)

count(staten)

## # A tibble: 1 × 1
##       n
##   <int>
## 1  8977

sum(is.na(staten$`GROSS SQUARE FEET`))

## [1] 882

sum(is.na(staten$`YEAR BUILT`))

## [1] 484

staten %>%
  filter(`SALE PRICE` >=1000) %>%
  subset(select = -c(`GROSS SQUARE FEET`,`YEAR BUILT`)) -> staten2
View(staten2)

In Staten Island, 882 values are missing for GROSS SQUARE FEET, and 484 data are missing for YEAR BUILT, while including 8,877 data in the dataset staten.

Due to the same reasons as data for Manhattan, I used the select function to exclude columns of GROSS SQUARE FEET and YEAR BUILT and only comprised properties with sale price greater or equal to $1,000 in the other four datasets using the filter function. Four new datasets bronx2, brooklyn2, queens2, and statens2 are created.

Merging Datasets

In the code below, I used full_join to merge two datasets each time by columns of BOROUGH, NEIGHBORHOOD, BUILDING CLASS CATEGORY, ZIP CODE, and SALE PRICE.

data1 <- full_join(manhattan2,bronx2,
                  by=c("BOROUGH","NEIGHBORHOOD","BUILDING CLASS CATEGORY","ZIP CODE","SALE PRICE"))
View(data1)

data2 <- full_join(brooklyn2,queens2,
                   by=c("BOROUGH","NEIGHBORHOOD","BUILDING CLASS CATEGORY","ZIP CODE","SALE PRICE"))
View(data2)

data3 <- full_join(data1,data2,
                   by=c("BOROUGH","NEIGHBORHOOD","BUILDING CLASS CATEGORY","ZIP CODE","SALE PRICE"))
View(data3)

data <- full_join(data3,staten2,
                   by=c("BOROUGH","NEIGHBORHOOD","BUILDING CLASS CATEGORY","ZIP CODE","SALE PRICE"))
View(data)

data$BOROUGH <- factor(data$BOROUGH,
                    levels = c(1,2,3,4,5),
                    labels = c("Manhattan", "Bronx", "Brooklyn","Queens","Staten Island"))
View(data)

Dataset data1 includes data from manhattan2 and bronx2, and dataset data1 contains data from brooklyn2, queens2. Dataset data3 is a combination of data1 and data2. My merged dataset data embracing all information for five different boroughs in New York City is created after merging dataset data3 with dataset staten2. Finally, I assigned the names of the five boroughs in New York City to the matching number in the BOROUGH column: 1 for Manhattan, 2 for Bronx, 3 for Brooklyn, 4 for Queens, and 5 for Staten Island.

Exporting Data

I loaded library writexl in R and used write_xlsx function to export the merged dataset data as an elsx file.

library(writexl)
write_xlsx(data, "C:\\Users\\winnieyang\\Desktop\\MEA\\Map Project\\data.xlsx")

Most Expensive Neighborhood in New York City

data %>%
  group_by(NEIGHBORHOOD,BOROUGH) %>%
  mutate(`MEDIAN SALE PRICE`= median(`SALE PRICE`)) -> median
View(median)

median %>%
  select(`NEIGHBORHOOD`, `BOROUGH`, `MEDIAN SALE PRICE`) %>%
  group_by(`MEDIAN SALE PRICE`) %>%
  arrange(desc(`MEDIAN SALE PRICE`)) -> median2
View(median2)

median2[!duplicated(median2$`MEDIAN SALE PRICE`),] %>%
  head(20) -> median3
View(median3)

median3$`MEDIAN SALE PRICE` <- round(median3$`MEDIAN SALE PRICE`,digits = 0)
knitr::kable(median3)

NEIGHBORHOOD	BOROUGH	MEDIAN SALE PRICE
JAVITS CENTER	Manhattan	4341500
COBBLE HILL	Brooklyn	3700000
LITTLE ITALY	Manhattan	3372500
HARLEM-WEST	Manhattan	3000000
CIVIC CENTER	Manhattan	2905000
TRIBECA	Manhattan	2750000
ROSSVILLE-PORT MOBIL	Staten Island	2350000
SOHO	Manhattan	2314741
FRESH KILLS	Staten Island	2300000
FASHION	Manhattan	2180000
FLATIRON	Manhattan	1832500
CITY ISLAND-PELHAM STRIP	Bronx	1800000
UPPER EAST SIDE (96-110)	Manhattan	1762500
HUNTS POINT	Bronx	1650000
GREENWICH VILLAGE-WEST	Manhattan	1640000
BOERUM HILL	Brooklyn	1635000
CHELSEA	Manhattan	1630000
DOWNTOWN-FULTON FERRY	Brooklyn	1605000
BROOKLYN HEIGHTS	Brooklyn	1520000
PARK SLOPE	Brooklyn	1500000

The table above shows the top 20 most expensive neighborhoods in New York City based on the median sale price of properties between April 2021 to February 2022. More than half of the top 20 most expensive neighborhoods are located in Manhattan, with several in Brooklyn. Based on this table, there is not a single top 20 most expensive neighborhood located in Queens.

Less Expensive Neighborhood in New York City

median %>%
  select(`NEIGHBORHOOD`, `BOROUGH`, `MEDIAN SALE PRICE`) %>%
  group_by(`MEDIAN SALE PRICE`) %>%
  arrange(`MEDIAN SALE PRICE`) -> median4
View(median4)

median4[!duplicated(median4$`MEDIAN SALE PRICE`),] %>%
  head(20) -> median5
View(median5)

median5$`MEDIAN SALE PRICE` <- round(median5$`MEDIAN SALE PRICE`,digits = 0)
knitr::kable(median5)

NEIGHBORHOOD	BOROUGH	MEDIAN SALE PRICE
PARKCHESTER	Bronx	200000
BEDFORD PARK/NORWOOD	Bronx	310000
VAN CORTLANDT PARK	Bronx	315000
BRIARWOOD	Queens	329000
OAKLAND GARDENS	Queens	330000
GLEN OAKS	Queens	345000
KEW GARDENS	Queens	350000
RIVERDALE	Bronx	363000
ARROCHAR-SHORE ACRES	Staten Island	368500
BROAD CHANNEL	Queens	392200
KINGSBRIDGE/JEROME PARK	Bronx	395000
REGO PARK	Queens	405000
TOMPKINSVILLE	Staten Island	425000
HOWARD BEACH	Queens	427000
HILLCREST	Queens	430000
FLUSHING MEADOW PARK	Queens	430500
PORT IVORY	Staten Island	435000
FOREST HILLS	Queens	445000
CONEY ISLAND	Brooklyn	450000
INWOOD	Manhattan	455000

According to the median sale price of properties between April 2021 to February 2022, I also created a table showing the top 20 less expensive neighborhoods in New York City. Half the neighborhood in this table is located in Queens, five of them in the Bronx. Only one neighborhood in Brooklyn and one neighborhood in Manhattan contain in this table, and they take the last two places.

Note: Area with a zip code of 10013 is excluded on the map Real State Price in Brooklyn. Zip code 10013 belongs to Manhattan, but one row of data is inputted incorrectly in Brooklyn in the original dataset. Therefore, I decided to add a filter on the zip code to delete the data, including the zip code 10013.

Conclusion

It is a fact that New York City is the most expensive city to live in in the U.S.; however, the range of sale prices of properties within the city is widely distributed. With Manhattan ranking first for the median sale price of 1250k, being the most expensive borough in The Big Apple, there are cheaper and more affordable options in the city: Staten Island and the Bronx, each having a median sale price of 610k and 605k.

Within each borough, the range of properties price in different areas and neighborhoods can be broad. For example, the median sale price of properties based on Manhattan’s zip codes ranges from 420k to 44,960k. The area with a zip code of 11224 has the lowest median sale price of 485k, while 11231 is the most expensive zip code, which is 1,515k in Brooklyn.

Future studies can focus on investigating factors affecting the sale price for an area or a neighborhood in New York City by looking at some of the most expensive and less expensive areas or neighborhoods and the facilities and infrastructures around them. Some of the possible aspects include public transportations (bus routes, subways, and train stations), major highways and roads, popular tourist destinations, central business districts, and the crime rate in each borough, area, and neighborhood.

Other Codes

The Median Cost of Properties

median(data$`SALE PRICE`)

## [1] 815000

I used median function to calculate the median cost of $815,000 for New York City properties and use it into the introduction for this project.

Zip Code

manhattan_zipcode <- unique(manhattan2$`ZIP CODE`)
sort(manhattan_zipcode, decreasing = FALSE)

bronx_zipcode <- unique(bronx2$`ZIP CODE`)
sort(bronx_zipcode, decreasing = FALSE)

brooklyn_zipcode <- unique(brooklyn2$`ZIP CODE`)
sort(brooklyn_zipcode, decreasing = FALSE)

queens_zipcode <- unique(queens2$`ZIP CODE`)
sort(queens_zipcode, decreasing = FALSE)

staten_zipcode <- unique(staten2$`ZIP CODE`)
sort(staten_zipcode, decreasing = FALSE)

Using the unique function allowed me to remove duplicated elements and printed zip codes for five different boroughs. I also applied the sort function to sort them in ascending order. Having all zip codes for boroughs outputted in R benefited me in creating groups for zip codes in Tableau and added different colors based on boroughs when mapping.