library(tidyverse)
In this coding chunk we merge the three datasets.
We’ll add any records that match from PLUTO to PVAD into our new main database which will start from PVAD.
Then separately we’ll add number of units from PVAD to the Sales data so we can calculate the average price per square foot of the building, which we can add as an additional feature to our new main database.
# Load the three data sets
pvad <- readRDS("~/Documents/D698/pvad2.rds")
pluto <- readRDS("~/Documents/D698/pluto2.rds")
sales <- readRDS("~/Documents/D698/sales2.rds")
First we need to make sure that there is a unique
BLOCK-LOT
combination for every record. In
pvad
s case there are 8,871 with duplicate records.
pvad %>%
count(BLOCK, LOT) %>%
filter(n > 1)
## # A tibble: 8,871 × 3
## BLOCK LOT n
## <dbl> <dbl> <int>
## 1 638 19 2
## 2 1001 59 2
## 3 1001 7501 2
## 4 1003 1003 2
## 5 1003 1004 2
## 6 1003 1005 2
## 7 1003 1006 2
## 8 1003 1007 2
## 9 1003 1008 2
## 10 1003 1009 2
## # ℹ 8,861 more rows
Upon inspection of the first duplicate pair, FINMKTTOT
is zero for one of them.
pvad %>%
filter(BLOCK==638,LOT==19)
## # A tibble: 2 × 31
## PARID BORO BLOCK LOT EASEMENT YEAR FINMKTTOT FINTAXCLASS BLDG_CLASS
## <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <chr>
## 1 1006380019 1 638 19 NA 2025 0 NA C6
## 2 1006380019 1 638 19 NA 2025 12241000 2 C6
## # ℹ 22 more variables: ZONING <chr>, ZIP_CODE <dbl>, LOT_FRT <dbl>,
## # LOT_DEP <dbl>, LOT_IRREG <chr>, BLD_FRT <dbl>, BLD_DEP <dbl>,
## # BLD_EXT <chr>, BLD_STORY <dbl>, CORNER <chr>, LAND_AREA <dbl>,
## # NUM_BLDGS <dbl>, YRBUILT <dbl>, YRALT1 <dbl>, YRALT2 <dbl>,
## # COOP_APTS <dbl>, UNITS <dbl>, CONDO_Number <dbl>, GROSS_SQFT <dbl>,
## # OFFICE_AREA_GROSS <dbl>, RESIDENTIAL_AREA_GROSS <dbl>, FINTAXFLAG <chr>
So we’ll remove all records with FINMKTTOT
equal to
zero.
pvad2 <- pvad %>%
filter(FINMKTTOT != 0)
And then recompare… And success! There are no duplicates now.
pvad2 %>%
count(BLOCK, LOT) %>%
filter(n > 1)
## # A tibble: 0 × 3
## # ℹ 3 variables: BLOCK <dbl>, LOT <dbl>, n <int>
Again, we’ll use unique combinations of block
and
lot
for each record… and success - there are no duplicate
records by block
and lot
pluto %>%
count(block, lot) %>%
filter(n > 1)
## # A tibble: 0 × 3
## # ℹ 3 variables: block <dbl>, lot <dbl>, n <int>
Here we use a left join to keep all the records in pvad2
and bring in matching records from pluto
. Before we do that
we uppercase block
and lot
in
pluto
so the column names that we’re joining by match.
Note we were missing 24% of zipcodes for the PLUTO data so by leftjoining to PVAD we’re taking advantage of the non-missingness in the PVAD data to screen the PLUTO data for the records we want.
# Make sure PLUTO column names are uppercase to match PVAD
pluto2 <- pluto %>%
rename(BLOCK = block, LOT = lot)
# Merge using left_join
df <- pvad2 %>%
left_join(pluto2, by = c("BLOCK", "LOT"))
Here we calculate average sale price from the Sales data before we merge it into our new main database.
Originally we were going to take average sale price, multiply it by the number of units in the building and divide by the building’s square footage, however all we are looking for is a proxy for shareholder financial sophistication or access to legal/accounting/tax resources and average sale price might be a clearer signal than if we were to multiply by number of units and divide by building’s square footage. There could be inconsistencies in how those fields are stored or examples of buildings with large squarefootage attributed to commercial space so fewer units.
First identify if there areany sale prices of zero or NA… and there are 186 of them.
sales %>%
mutate(is_zero_or_na = is.na(`SALE PRICE`) | `SALE PRICE` == 0) %>%
count(is_zero_or_na)
## # A tibble: 2 × 2
## is_zero_or_na n
## <lgl> <int>
## 1 FALSE 547
## 2 TRUE 186
Now we remove the records with a sale price of zero or NA so we are left with 547.
sales2 <- sales %>%
filter(!is.na(`SALE PRICE`) & `SALE PRICE` != 0)
Here we take the average sale price and reduce to only records with
unique BLOCK
LOT
combinations.
sales3 <- sales2 %>%
group_by(BLOCK, LOT) %>%
summarise(ASP = mean(`SALE PRICE`, na.rm = TRUE), .groups = "drop")
Now we left join the average sale price (or ASP
) to our
main dataset
df2 <- df %>%
left_join(sales3, by = c("BLOCK", "LOT"))
Here we write the file for use in our next code chunk
# Write file
saveRDS(df2, file="df.rds")