library(readr)
library(tidyverse)
We’re taking this data and the expanded version below as our starting point.
https://catalog.data.gov/dataset/property-valuation-and-assessment-data-db7c2
“Real Estate Assessment Property data. The Department of Finance values properties every year as one step in calculating property tax bills.”
Here we load the data that we previously downloaded from the website above.
When we ran the first time there were 1,692 rows with parsing issues.
It was the first variable BBLE
which we assume is going to
be a primary key (Building Block Lot Easement) which was being read as a
double but then the ones with a letter after the ten digits (the
Easement) are getting forced to drop the letter so we have to specify
the column types to preserve BBLE
df <- read_csv("~/Documents/D698/Property_Valuation_and_Assessment_Data.csv",
col_types = cols(BBLE = col_character(), .default = col_guess()))
Here we look at the first ten records and confirm that we see the
full BBLE
.
# code to review parsing issues if any
#problems(df)
head(df,15)
## # A tibble: 15 × 40
## BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT LTDEPTH EXT
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 10001… 1 16 3859 <NA> CHEN… R4 2 0 0 <NA>
## 2 10007… 1 73 28 <NA> NYC … V1 4 183 52 <NA>
## 3 10007… 1 73 29 <NA> NYC … Y7 4 90 500 <NA>
## 4 10002… 1 29 7504 <NA> <NA> R0 2 36 73 <NA>
## 5 10003… 1 36 12 <NA> NYC … Y7 4 534 604 <NA>
## 6 10007… 1 73 8 <NA> NYC … T2 4 238 597 <NA>
## 7 10017… 1 174 9031 <NA> <NA> V1 4 54 165 <NA>
## 8 10017… 1 176 118 <NA> <NA> V1 4 0 50 <NA>
## 9 10019… 1 198 1 E NYC … U7 3 0 0 <NA>
## 10 10020… 1 207 1 E NYC … U7 3 0 0 <NA>
## 11 10020… 1 208 19 E NYC … U7 3 0 0 <NA>
## 12 10020… 1 209 1001 <NA> 424 … RK 4 0 0 <NA>
## 13 10067… 1 676 3 G <NA> U9 3 0 0 <NA>
## 14 10067… 1 676 5 <NA> MTA … V1 4 800 538 <NA>
## 15 10022… 1 221 1 <NA> NYC … U1 3 161 461 <NA>
## # ℹ 29 more variables: STORIES <dbl>, FULLVAL <dbl>, AVLAND <dbl>, AVTOT <dbl>,
## # EXLAND <dbl>, EXTOT <dbl>, EXCD1 <dbl>, STADDR <chr>, POSTCODE <dbl>,
## # EXMPTCL <chr>, BLDFRONT <dbl>, BLDDEPTH <dbl>, AVLAND2 <dbl>, AVTOT2 <dbl>,
## # EXLAND2 <dbl>, EXTOT2 <dbl>, EXCD2 <dbl>, PERIOD <chr>, YEAR <chr>,
## # VALTYPE <chr>, Borough <chr>, Latitude <dbl>, Longitude <dbl>,
## # `Community Board` <dbl>, `Council District` <dbl>, `Census Tract` <dbl>,
## # BIN <dbl>, NTA <chr>, `New Georeferenced Column` <chr>
Before we manipulate the data we need to reduce from nearly 10
million records (9,845,857) by subsetting just the Class 2 properties,
and then properties in Manhattan. We’re still at 997,655 records so we
going to constrict our data further by limiting to just the zip code of
10019
and are left with 71,246 records.
df2 <- subset(df,TAXCLASS == "2")
df3 <- subset(df2,BORO == 1)
df4 <- subset(df3,POSTCODE == 10019)
Here we look at the first 50 rows and observe that the
BBLE
is made up of the BORO
BLOCK
LOT
and EASEMENT
.
You can go to the follow website to enter any borough block and lot number to see a map of it and the surrounding area. For example Block 1046 in Manhattan is between 55th and 56th Street and between 8th and 9th Avenue, and Lot 23 is a specific lot in that block.
https://propertyinformationportal.nyc.gov/
head(df4,10)
## # A tibble: 10 × 40
## BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT LTDEPTH EXT
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 10106… 1 1060 7502 <NA> <NA> R0 2 150 200 <NA>
## 2 10104… 1 1045 61 <NA> 360 … D8 2 122 175 <NA>
## 3 10063… 1 638 19 <NA> WVH … C6 2 196 258 <NA>
## 4 10100… 1 1006 1219 <NA> <NA> R4 2 0 0 <NA>
## 5 10106… 1 1060 7502 <NA> <NA> R0 2 150 200 <NA>
## 6 10104… 1 1045 61 <NA> 360 … D8 2 122 175 <NA>
## 7 10104… 1 1048 7504 <NA> EBY … R0 2 163 200 <NA>
## 8 10104… 1 1043 7502 <NA> 300 … R0 2 150 100 <NA>
## 9 10104… 1 1047 1054 <NA> KIM … R4 2 0 0 <NA>
## 10 10104… 1 1047 1047 <NA> WU, … R4 2 0 0 <NA>
## # ℹ 29 more variables: STORIES <dbl>, FULLVAL <dbl>, AVLAND <dbl>, AVTOT <dbl>,
## # EXLAND <dbl>, EXTOT <dbl>, EXCD1 <dbl>, STADDR <chr>, POSTCODE <dbl>,
## # EXMPTCL <chr>, BLDFRONT <dbl>, BLDDEPTH <dbl>, AVLAND2 <dbl>, AVTOT2 <dbl>,
## # EXLAND2 <dbl>, EXTOT2 <dbl>, EXCD2 <dbl>, PERIOD <chr>, YEAR <chr>,
## # VALTYPE <chr>, Borough <chr>, Latitude <dbl>, Longitude <dbl>,
## # `Community Board` <dbl>, `Council District` <dbl>, `Census Tract` <dbl>,
## # BIN <dbl>, NTA <chr>, `New Georeferenced Column` <chr>
And here is just looking at Block/Lot 1046/23 in Manhattan:
df5 <- df4 %>%
filter(BLOCK == 1046) %>%
filter(LOT == 23)
df5
## # A tibble: 9 × 40
## BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT LTDEPTH EXT
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 101046… 1 1046 23 <NA> 317 … D4 2 57 100 <NA>
## 2 101046… 1 1046 23 <NA> 317 … D4 2 57 100 <NA>
## 3 101046… 1 1046 23 <NA> 317 … D4 2 57 100 <NA>
## 4 101046… 1 1046 23 <NA> 317 … D4 2 57 100 <NA>
## 5 101046… 1 1046 23 <NA> 317 … D4 2 57 100 <NA>
## 6 101046… 1 1046 23 <NA> 317 … D4 2 57 100 <NA>
## 7 101046… 1 1046 23 <NA> 317 … D4 2 57 100 <NA>
## 8 101046… 1 1046 23 <NA> 317 … D4 2 57 100 <NA>
## 9 101046… 1 1046 23 <NA> 317 … D4 2 57 100 <NA>
## # ℹ 29 more variables: STORIES <dbl>, FULLVAL <dbl>, AVLAND <dbl>, AVTOT <dbl>,
## # EXLAND <dbl>, EXTOT <dbl>, EXCD1 <dbl>, STADDR <chr>, POSTCODE <dbl>,
## # EXMPTCL <chr>, BLDFRONT <dbl>, BLDDEPTH <dbl>, AVLAND2 <dbl>, AVTOT2 <dbl>,
## # EXLAND2 <dbl>, EXTOT2 <dbl>, EXCD2 <dbl>, PERIOD <chr>, YEAR <chr>,
## # VALTYPE <chr>, Borough <chr>, Latitude <dbl>, Longitude <dbl>,
## # `Community Board` <dbl>, `Council District` <dbl>, `Census Tract` <dbl>,
## # BIN <dbl>, NTA <chr>, `New Georeferenced Column` <chr>
I’m concerned about the YEAR
column because it looks as
if this data hasn’t been updated since “2018/19”. To determine if every
building has a “2018/19” record or if I should get a single record for
each building that is the most recent YEAR
available, I’ve
counted the unique Block-lot combinations before and after having subset
for the 2018/19 year and get 8,756 versus 8,402.
Since the delta is small, for our purposes for now we will only collect the records that have a year of 2018/19 before we write this to be used in other Code Chunks.
It may be that we abandon this dataset in favor of the expanded version which we will explore next.
# Count how many unique block-lot combinations there are
df4 %>%
distinct(BLOCK, LOT) %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 8756
# Count how many unique block-lot combinations there for year 2018/19
df4 %>%
filter(YEAR=='2018/19') %>%
distinct(BLOCK, LOT) %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 8402
df6 <- df4 %>%
filter(YEAR == '2018/19')
Here we save the data for later
write.csv(df6, file="PVAD.csv")
Here are the fields for reference
t(head(df6, n=1))
## [,1]
## BBLE "1010607502"
## BORO "1"
## BLOCK "1060"
## LOT "7502"
## EASEMENT NA
## OWNER NA
## BLDGCL "R0"
## TAXCLASS "2"
## LTFRONT "150"
## LTDEPTH "200"
## EXT NA
## STORIES "16"
## FULLVAL "0"
## AVLAND "0"
## AVTOT "0"
## EXLAND "0"
## EXTOT "0"
## EXCD1 NA
## STADDR "425 WEST 50 STREET"
## POSTCODE "10019"
## EXMPTCL NA
## BLDFRONT "0"
## BLDDEPTH "0"
## AVLAND2 NA
## AVTOT2 NA
## EXLAND2 NA
## EXTOT2 NA
## EXCD2 NA
## PERIOD "FINAL"
## YEAR "2018/19"
## VALTYPE "AC-TR"
## Borough "MANHATTAN"
## Latitude "40.76388"
## Longitude "-73.98985"
## Community Board "104"
## Council District "3"
## Census Tract "133"
## BIN "1026676"
## NTA "Clinton"
## New Georeferenced Column "POINT (-73.989853 40.763878)"