library(tidyverse)

Merging Data

In this coding chunk we merge the three datasets.

We’ll add any records that match from PLUTO to PVAD into our new main database which will start from PVAD.

Then separately we’ll add number of units from PVAD to the Sales data so we can calculate the average price per square foot of the building, which we can add as an additional feature to our new main database.


Load Data Sets

# Load the three data sets
pvad <- readRDS("~/Documents/D698/pvad2.rds")
pluto <- readRDS("~/Documents/D698/pluto2.rds")
sales <- readRDS("~/Documents/D698/sales2.rds")


Merge Pluto into PVAD

Confirm no duplicates in PVAD

First we need to make sure that there is a unique BLOCK-LOT combination for every record. In pvads case there are 8,871 with duplicate records.

pvad %>%
  count(BLOCK, LOT) %>%
  filter(n > 1)
## # A tibble: 8,871 × 3
##    BLOCK   LOT     n
##    <dbl> <dbl> <int>
##  1   638    19     2
##  2  1001    59     2
##  3  1001  7501     2
##  4  1003  1003     2
##  5  1003  1004     2
##  6  1003  1005     2
##  7  1003  1006     2
##  8  1003  1007     2
##  9  1003  1008     2
## 10  1003  1009     2
## # ℹ 8,861 more rows

Upon inspection of the first duplicate pair, FINMKTTOT is zero for one of them.

pvad %>%
  filter(BLOCK==638,LOT==19)
## # A tibble: 2 × 31
##        PARID  BORO BLOCK   LOT EASEMENT  YEAR FINMKTTOT FINTAXCLASS BLDG_CLASS
##        <dbl> <dbl> <dbl> <dbl> <lgl>    <dbl>     <dbl>       <dbl> <chr>     
## 1 1006380019     1   638    19 NA        2025         0          NA C6        
## 2 1006380019     1   638    19 NA        2025  12241000           2 C6        
## # ℹ 22 more variables: ZONING <chr>, ZIP_CODE <dbl>, LOT_FRT <dbl>,
## #   LOT_DEP <dbl>, LOT_IRREG <chr>, BLD_FRT <dbl>, BLD_DEP <dbl>,
## #   BLD_EXT <chr>, BLD_STORY <dbl>, CORNER <chr>, LAND_AREA <dbl>,
## #   NUM_BLDGS <dbl>, YRBUILT <dbl>, YRALT1 <dbl>, YRALT2 <dbl>,
## #   COOP_APTS <dbl>, UNITS <dbl>, CONDO_Number <dbl>, GROSS_SQFT <dbl>,
## #   OFFICE_AREA_GROSS <dbl>, RESIDENTIAL_AREA_GROSS <dbl>, FINTAXFLAG <chr>

So we’ll remove all records with FINMKTTOT equal to zero.

pvad2 <- pvad %>%
  filter(FINMKTTOT != 0)

And then recompare… And success! There are no duplicates now.

pvad2 %>%
  count(BLOCK, LOT) %>%
  filter(n > 1)
## # A tibble: 0 × 3
## # ℹ 3 variables: BLOCK <dbl>, LOT <dbl>, n <int>


Confirm no Duplicates in PLUTO

Again, we’ll use unique combinations of block and lot for each record… and success - there are no duplicate records by block and lot

pluto %>%
  count(block, lot) %>%
  filter(n > 1)
## # A tibble: 0 × 3
## # ℹ 3 variables: block <dbl>, lot <dbl>, n <int>


Merge PLUTO and PVAD

Here we use a left join to keep all the records in pvad2 and bring in matching records from pluto. Before we do that we uppercase block and lot in pluto so the column names that we’re joining by match.

Note we were missing 24% of zipcodes for the PLUTO data so by leftjoining to PVAD we’re taking advantage of the non-missingness in the PVAD data to screen the PLUTO data for the records we want.

# Make sure PLUTO column names are uppercase to match PVAD
pluto2 <- pluto %>%
  rename(BLOCK = block, LOT = lot)

# Merge using left_join
df <- pvad2 %>%
  left_join(pluto2, by = c("BLOCK", "LOT"))


Add Average Sale Price

Here we calculate average sale price from the Sales data before we merge it into our new main database.

Originally we were going to take average sale price, multiply it by the number of units in the building and divide by the building’s square footage, however all we are looking for is a proxy for shareholder financial sophistication or access to legal/accounting/tax resources and average sale price might be a clearer signal than if we were to multiply by number of units and divide by building’s square footage. There could be inconsistencies in how those fields are stored or examples of buildings with large squarefootage attributed to commercial space so fewer units.

First identify if there areany sale prices of zero or NA… and there are 186 of them.

sales %>%
  mutate(is_zero_or_na = is.na(`SALE PRICE`) | `SALE PRICE` == 0) %>%
  count(is_zero_or_na)
## # A tibble: 2 × 2
##   is_zero_or_na     n
##   <lgl>         <int>
## 1 FALSE           547
## 2 TRUE            186

Now we remove the records with a sale price of zero or NA so we are left with 547.

sales2 <- sales %>%
  filter(!is.na(`SALE PRICE`) & `SALE PRICE` != 0)

Here we take the average sale price and reduce to only records with unique BLOCK LOT combinations.

sales3 <- sales2 %>%
  group_by(BLOCK, LOT) %>%
  summarise(ASP = mean(`SALE PRICE`, na.rm = TRUE), .groups = "drop")

Now we left join the average sale price (or ASP) to our main dataset

df2 <- df %>%
  left_join(sales3, by = c("BLOCK", "LOT"))


Write Data

Here we write the file for use in our next code chunk

# Write file
saveRDS(df2, file="df.rds")