library(readr)
library(tidyverse)
library(readxl)
We should be able to subset for properties that have sold in the last 12 months to train and test our model
https://www.nyc.gov/site/finance/property/property-rolling-sales-data.page
“The Department of Finance’s rolling sales files list tax class 1, 2, and 4 properties that have sold in the last 12-month period in New York City. These files include the neighborhood, building type, square footage, and other data.”
Here we load the sales data that we previously downloaded from the website above.
The excel file read with zero parsing issues.
df <- read_excel("~/Documents/D698/rollingsales_manhattan.xlsx", skip = 4)
#No more columns with parsing issues
# code to review parsing issues if any
#problems <- problems(df)
#problems %>%
# distinct(col) %>%
# pull(col)
Here we filter the data to only look at properties in Manhattan.
This also has a more complex tax class structure than what we’ve seen
before, so we’re going to filter for tax classes that include
2
2A
2B
or 2C
.
# Subset the data
df2 <- df %>%
filter(`BOROUGH` == "1",
`TAX CLASS AT PRESENT` %in% c("2", "2A", "2B", "2C"))
# Check for non standard tax class designations
df %>%
distinct(`TAX CLASS AT PRESENT`) %>%
pull(`TAX CLASS AT PRESENT`)
## [1] "1" "2B" "2" "2A" "2C" "4" "1C" "1A"
Here we save the data for later
#No more writing to csv
#write.csv(df2, file="PVADex.csv")
saveRDS(df2, file="SALES.rds")
Here are the fields for reference
t(head(df, n=1))
## [,1]
## BOROUGH "1"
## NEIGHBORHOOD "ALPHABET CITY"
## BUILDING CLASS CATEGORY "01 ONE FAMILY DWELLINGS"
## TAX CLASS AT PRESENT "1"
## BLOCK "376"
## LOT "43"
## EASEMENT NA
## BUILDING CLASS AT PRESENT "S1"
## ADDRESS "743 EAST 6 STREET"
## APARTMENT NUMBER NA
## ZIP CODE "10009"
## RESIDENTIAL UNITS "1"
## COMMERCIAL UNITS "1"
## TOTAL UNITS "2"
## LAND SQUARE FEET "2090"
## GROSS SQUARE FEET "3680"
## YEAR BUILT "1940"
## TAX CLASS AT TIME OF SALE "1"
## BUILDING CLASS AT TIME OF SALE "S1"
## SALE PRICE "0"
## SALE DATE "2025-01-23"