People relocating for work, moving to another city from another, or simply planning a vacation can benefit from knowing details about real estate prices in different cities and areas. Understanding the cost of living and having an idea about the real estate market can help individuals and families make or break a decision to move. As a city with a vast amount of job opportunities for different industries that can satisfy one’s needs and pursue one’s aspirations and one of the most popular destinations for tourism globally, New York City attracts thousands and millions of people to reside and visit each year. However, New York City is also knowns as the most expensive city to live in in the United States. According to a biannual report by the Economist Intelligence Unit in 2021, The Big Apple ranks 6th as the most expensive city in the world (https://www.eiu.com/n/tel-aviv-overtakes-paris-to-become-worlds-most-expensive-city/). Based on my dataset, the median cost of homes in the five boroughs of New York City is $815,000, compared with the national median of $346,900 in 2021 (https://www.cnn.com/2022/01/20/homes/us-nar-home-sales-december-and-2021/index.html).
This project aims to create a clear visualization of the price of properties sold between April 2021 to February 2022 in New York City by analyzing geospatial data and making various maps reflecting property prices in different parts of New York City. Five datasets from the New York City Department of Finance have been used for this project (https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page). Each includes New York City sales data for a borough, including Manhattan, Bronx, Brooklyn, Queens, and Staten Island.
Before importing the datasets into R, I first deleted some columns in excel that are not related to the interest of my project, including tax class at present, block, lot, easement, building class at present, address, apartment number, residential units, commercial units, total units, land square, feet, tax class at the time of sale, building class at the time of sale, and sale date.
I imported the updated five datasets into R in the following code and did further steps to clean the data.
library(dplyr)
library(readxl)
manhattan <- read_excel("Map Project/manhattan.xlsx",
skip = 1)
View(manhattan)
count(manhattan)
## # A tibble: 1 × 1
## n
## <int>
## 1 23615
sum(is.na(manhattan$`GROSS SQUARE FEET`))
## [1] 21718
sum(is.na(manhattan$`YEAR BUILT`))
## [1] 2437
manhattan %>%
filter(`SALE PRICE` >=1000) %>%
subset(select = -c(`GROSS SQUARE FEET`,`YEAR BUILT`)) -> manhattan2
View(manhattan2)
After counting the amount of data for Manhattan in general, the amount of data in the GROSS SQUARE FEET column, and the YEAR BUILT column in dataset manhattan, I found that these two columns contain a large number of missing data which will significantly impact the results if I analyze data from these two columns. Therefore, I decided to exclude GROSS SQUARE FEET and YEAR BUILT columns in my dataset, using the select function.
Additionally, I have also realized that some properties were sold at an unreasonable low price that could not correctly reflect the real estate market in New York City. I conjecture that they might be sold between family members and friends or bestowed on someone. It is necessary to exclude these properties from my data analysis. Hens, I used the filter function to filter out any properties sold under $1,000 for dataset manhattan. The new dataset for Manhattan containing properties sold equal to or over $1,000 without column GROSS SQUARE FEET and column YEAR BUILT is called manhattan2.
Same issues have been observed in the other four datasets, representing properties sold in the Bronx, Brooklyn, Queens, and Staten Island. Many missing data are noticed for columns GROSS SQUARE FEET and YEAR BUILT and properties sold at significantly low prices.
library(readxl)
bronx <- read_excel("Map Project/bronx.xlsx",
skip = 1)
View(bronx)
count(bronx)
## # A tibble: 1 × 1
## n
## <int>
## 1 7490
sum(is.na(bronx$`GROSS SQUARE FEET`))
## [1] 1929
sum(is.na(bronx$`YEAR BUILT`))
## [1] 631
bronx %>%
filter(`SALE PRICE` >=1000) %>%
subset(select = -c(`GROSS SQUARE FEET`,`YEAR BUILT`)) -> bronx2
View(bronx2)
With 7,490 amounts of data in dataset Bronx, there are 1,929 missing data for GROSS SQUARE FEET and 631 missing data for YEAR BUILT.
library(readxl)
brooklyn <- read_excel("Map Project/brooklyn.xlsx",
skip = 1)
View(brooklyn)
count(brooklyn)
## # A tibble: 1 × 1
## n
## <int>
## 1 28398
sum(is.na(brooklyn$`GROSS SQUARE FEET`))
## [1] 11762
sum(is.na(brooklyn$`YEAR BUILT`))
## [1] 2314
brooklyn %>%
filter(`SALE PRICE` >=1000) %>%
subset(select = -c(`GROSS SQUARE FEET`,`YEAR BUILT`)) -> brooklyn2
View(brooklyn2)
Dataset brooklyn contains 28,398 amounts of data, with missing data of 11,762 for GROSS SQUARE FEET and 2,314 for YEAR BUILT.
library(readxl)
queens <- read_excel("Map Project/queens.xlsx",
skip = 1)
View(queens)
count(queens)
## # A tibble: 1 × 1
## n
## <int>
## 1 29219
sum(is.na(queens$`GROSS SQUARE FEET`))
## [1] 9900
sum(is.na(queens$`YEAR BUILT`))
## [1] 1210
queens %>%
filter(`SALE PRICE` >=1000) %>%
subset(select = -c(`GROSS SQUARE FEET`,`YEAR BUILT`)) -> queens2
View(queens2)
For Queens, there are 29,219 data in the dataset; however, having 9,900 missing data for GROSS SQUARE FEET and 1,210 missing data for YEAR BUILT.
library(readxl)
staten <- read_excel("Map Project/statenisland.xlsx",
skip = 1)
View(staten)
count(staten)
## # A tibble: 1 × 1
## n
## <int>
## 1 8977
sum(is.na(staten$`GROSS SQUARE FEET`))
## [1] 882
sum(is.na(staten$`YEAR BUILT`))
## [1] 484
staten %>%
filter(`SALE PRICE` >=1000) %>%
subset(select = -c(`GROSS SQUARE FEET`,`YEAR BUILT`)) -> staten2
View(staten2)
In Staten Island, 882 values are missing for GROSS SQUARE FEET, and 484 data are missing for YEAR BUILT, while including 8,877 data in the dataset staten.
Due to the same reasons as data for Manhattan, I used the select function to exclude columns of GROSS SQUARE FEET and YEAR BUILT and only comprised properties with sale price greater or equal to $1,000 in the other four datasets using the filter function. Four new datasets bronx2, brooklyn2, queens2, and statens2 are created.
In the code below, I used full_join to merge two datasets each time by columns of BOROUGH, NEIGHBORHOOD, BUILDING CLASS CATEGORY, ZIP CODE, and SALE PRICE.
data1 <- full_join(manhattan2,bronx2,
by=c("BOROUGH","NEIGHBORHOOD","BUILDING CLASS CATEGORY","ZIP CODE","SALE PRICE"))
View(data1)
data2 <- full_join(brooklyn2,queens2,
by=c("BOROUGH","NEIGHBORHOOD","BUILDING CLASS CATEGORY","ZIP CODE","SALE PRICE"))
View(data2)
data3 <- full_join(data1,data2,
by=c("BOROUGH","NEIGHBORHOOD","BUILDING CLASS CATEGORY","ZIP CODE","SALE PRICE"))
View(data3)
data <- full_join(data3,staten2,
by=c("BOROUGH","NEIGHBORHOOD","BUILDING CLASS CATEGORY","ZIP CODE","SALE PRICE"))
View(data)
data$BOROUGH <- factor(data$BOROUGH,
levels = c(1,2,3,4,5),
labels = c("Manhattan", "Bronx", "Brooklyn","Queens","Staten Island"))
View(data)
Dataset data1 includes data from manhattan2 and bronx2, and dataset data1 contains data from brooklyn2, queens2. Dataset data3 is a combination of data1 and data2. My merged dataset data embracing all information for five different boroughs in New York City is created after merging dataset data3 with dataset staten2. Finally, I assigned the names of the five boroughs in New York City to the matching number in the BOROUGH column: 1 for Manhattan, 2 for Bronx, 3 for Brooklyn, 4 for Queens, and 5 for Staten Island.
I loaded library writexl in R and used write_xlsx function to export the merged dataset data as an elsx file.
library(writexl)
write_xlsx(data, "C:\\Users\\winnieyang\\Desktop\\MEA\\Map Project\\data.xlsx")
data %>%
group_by(NEIGHBORHOOD,BOROUGH) %>%
mutate(`MEDIAN SALE PRICE`= median(`SALE PRICE`)) -> median
View(median)
median %>%
select(`NEIGHBORHOOD`, `BOROUGH`, `MEDIAN SALE PRICE`) %>%
group_by(`MEDIAN SALE PRICE`) %>%
arrange(desc(`MEDIAN SALE PRICE`)) -> median2
View(median2)
median2[!duplicated(median2$`MEDIAN SALE PRICE`),] %>%
head(20) -> median3
View(median3)
median3$`MEDIAN SALE PRICE` <- round(median3$`MEDIAN SALE PRICE`,digits = 0)
knitr::kable(median3)
| NEIGHBORHOOD | BOROUGH | MEDIAN SALE PRICE |
|---|---|---|
| JAVITS CENTER | Manhattan | 4341500 |
| COBBLE HILL | Brooklyn | 3700000 |
| LITTLE ITALY | Manhattan | 3372500 |
| HARLEM-WEST | Manhattan | 3000000 |
| CIVIC CENTER | Manhattan | 2905000 |
| TRIBECA | Manhattan | 2750000 |
| ROSSVILLE-PORT MOBIL | Staten Island | 2350000 |
| SOHO | Manhattan | 2314741 |
| FRESH KILLS | Staten Island | 2300000 |
| FASHION | Manhattan | 2180000 |
| FLATIRON | Manhattan | 1832500 |
| CITY ISLAND-PELHAM STRIP | Bronx | 1800000 |
| UPPER EAST SIDE (96-110) | Manhattan | 1762500 |
| HUNTS POINT | Bronx | 1650000 |
| GREENWICH VILLAGE-WEST | Manhattan | 1640000 |
| BOERUM HILL | Brooklyn | 1635000 |
| CHELSEA | Manhattan | 1630000 |
| DOWNTOWN-FULTON FERRY | Brooklyn | 1605000 |
| BROOKLYN HEIGHTS | Brooklyn | 1520000 |
| PARK SLOPE | Brooklyn | 1500000 |
The table above shows the top 20 most expensive neighborhoods in New York City based on the median sale price of properties between April 2021 to February 2022. More than half of the top 20 most expensive neighborhoods are located in Manhattan, with several in Brooklyn. Based on this table, there is not a single top 20 most expensive neighborhood located in Queens.
median %>%
select(`NEIGHBORHOOD`, `BOROUGH`, `MEDIAN SALE PRICE`) %>%
group_by(`MEDIAN SALE PRICE`) %>%
arrange(`MEDIAN SALE PRICE`) -> median4
View(median4)
median4[!duplicated(median4$`MEDIAN SALE PRICE`),] %>%
head(20) -> median5
View(median5)
median5$`MEDIAN SALE PRICE` <- round(median5$`MEDIAN SALE PRICE`,digits = 0)
knitr::kable(median5)
| NEIGHBORHOOD | BOROUGH | MEDIAN SALE PRICE |
|---|---|---|
| PARKCHESTER | Bronx | 200000 |
| BEDFORD PARK/NORWOOD | Bronx | 310000 |
| VAN CORTLANDT PARK | Bronx | 315000 |
| BRIARWOOD | Queens | 329000 |
| OAKLAND GARDENS | Queens | 330000 |
| GLEN OAKS | Queens | 345000 |
| KEW GARDENS | Queens | 350000 |
| RIVERDALE | Bronx | 363000 |
| ARROCHAR-SHORE ACRES | Staten Island | 368500 |
| BROAD CHANNEL | Queens | 392200 |
| KINGSBRIDGE/JEROME PARK | Bronx | 395000 |
| REGO PARK | Queens | 405000 |
| TOMPKINSVILLE | Staten Island | 425000 |
| HOWARD BEACH | Queens | 427000 |
| HILLCREST | Queens | 430000 |
| FLUSHING MEADOW PARK | Queens | 430500 |
| PORT IVORY | Staten Island | 435000 |
| FOREST HILLS | Queens | 445000 |
| CONEY ISLAND | Brooklyn | 450000 |
| INWOOD | Manhattan | 455000 |
According to the median sale price of properties between April 2021 to February 2022, I also created a table showing the top 20 less expensive neighborhoods in New York City. Half the neighborhood in this table is located in Queens, five of them in the Bronx. Only one neighborhood in Brooklyn and one neighborhood in Manhattan contain in this table, and they take the last two places.
Note: Area with a zip code of 10013 is excluded on the map Real State Price in Brooklyn. Zip code 10013 belongs to Manhattan, but one row of data is inputted incorrectly in Brooklyn in the original dataset. Therefore, I decided to add a filter on the zip code to delete the data, including the zip code 10013.
It is a fact that New York City is the most expensive city to live in in the U.S.; however, the range of sale prices of properties within the city is widely distributed. With Manhattan ranking first for the median sale price of 1250k, being the most expensive borough in The Big Apple, there are cheaper and more affordable options in the city: Staten Island and the Bronx, each having a median sale price of 610k and 605k.
Within each borough, the range of properties price in different areas and neighborhoods can be broad. For example, the median sale price of properties based on Manhattan’s zip codes ranges from 420k to 44,960k. The area with a zip code of 11224 has the lowest median sale price of 485k, while 11231 is the most expensive zip code, which is 1,515k in Brooklyn.
Future studies can focus on investigating factors affecting the sale price for an area or a neighborhood in New York City by looking at some of the most expensive and less expensive areas or neighborhoods and the facilities and infrastructures around them. Some of the possible aspects include public transportations (bus routes, subways, and train stations), major highways and roads, popular tourist destinations, central business districts, and the crime rate in each borough, area, and neighborhood.
median(data$`SALE PRICE`)
## [1] 815000
I used median function to calculate the median cost of $815,000 for New York City properties and use it into the introduction for this project.
manhattan_zipcode <- unique(manhattan2$`ZIP CODE`)
sort(manhattan_zipcode, decreasing = FALSE)
bronx_zipcode <- unique(bronx2$`ZIP CODE`)
sort(bronx_zipcode, decreasing = FALSE)
brooklyn_zipcode <- unique(brooklyn2$`ZIP CODE`)
sort(brooklyn_zipcode, decreasing = FALSE)
queens_zipcode <- unique(queens2$`ZIP CODE`)
sort(queens_zipcode, decreasing = FALSE)
staten_zipcode <- unique(staten2$`ZIP CODE`)
sort(staten_zipcode, decreasing = FALSE)
Using the unique function allowed me to remove duplicated elements and printed zip codes for five different boroughs. I also applied the sort function to sort them in ascending order. Having all zip codes for boroughs outputted in R benefited me in creating groups for zip codes in Tableau and added different colors based on boroughs when mapping.