Assignment

For module 2 we’ll be looking at techniques for dealing with big data. In particular binning strategies and the datashader library (which possibly proves we’ll never need to bin large data for visualization ever again.) To demonstrate these concepts we’ll be looking at the PLUTO dataset put out by New York City’s department of city planning. PLUTO contains data about every tax lot in New York City. PLUTO data can be downloaded from here. Unzip them to the same directory as this notebook, and you should be able to read them in using this (or very similar) code. Also take note of the data dictionary, it’ll come in handy for this assignment.

Libraries

Load Dataset

bk <- read.csv('BK2017V11.csv', header=TRUE) # Brooklyn
bx <- read.csv('BX2017V11.csv', header=TRUE) # The Bronx
mn <- read.csv('MN2017V11.csv', header=TRUE) # Manhattan
qn <- read.csv('QN2017V11.csv', header=TRUE) # Queens
si <- read.csv('SI2017V11.csv', header=TRUE) # Staten Island


# Combine all .csv to a single file

all_PLUTO_data <- bind_rows(bk, bx, mn, qn, si)


all_PLUTO_data <- read.csv('pluto.csv', header=TRUE)

# Get column names

names(all_PLUTO_data)

##  [1] "borough"              "block"                "lot"                 
##  [4] "cd"                   "ct2010"               "cb2010"              
##  [7] "schooldist"           "council"              "zipcode"             
## [10] "firecomp"             "policeprct"           "healthcenterdistrict"
## [13] "healtharea"           "sanitboro"            "sanitdistrict"       
## [16] "sanitsub"             "address"              "zonedist1"           
## [19] "zonedist2"            "zonedist3"            "zonedist4"           
## [22] "overlay1"             "overlay2"             "spdist1"             
## [25] "spdist2"              "spdist3"              "ltdheight"           
## [28] "splitzone"            "bldgclass"            "landuse"             
## [31] "easements"            "ownertype"            "ownername"           
## [34] "lotarea"              "bldgarea"             "comarea"             
## [37] "resarea"              "officearea"           "retailarea"          
## [40] "garagearea"           "strgearea"            "factryarea"          
## [43] "otherarea"            "areasource"           "numbldgs"            
## [46] "numfloors"            "unitsres"             "unitstotal"          
## [49] "lotfront"             "lotdepth"             "bldgfront"           
## [52] "bldgdepth"            "ext"                  "proxcode"            
## [55] "irrlotcode"           "lottype"              "bsmtcode"            
## [58] "assessland"           "assesstot"            "exempttot"           
## [61] "yearbuilt"            "yearalter1"           "yearalter2"          
## [64] "histdist"             "landmark"             "builtfar"            
## [67] "residfar"             "commfar"              "facilfar"            
## [70] "borocode"             "bbl"                  "condono"             
## [73] "tract2010"            "xcoord"               "ycoord"              
## [76] "zonemap"              "zmcode"               "sanborn"             
## [79] "taxmap"               "edesignum"            "appbbl"              
## [82] "appdate"              "plutomapid"           "firm07_flag"         
## [85] "pfirm15_flag"         "version"              "dcpedited"           
## [88] "latitude"             "longitude"            "notes"

Part 1: Binning and Aggregation

Binning is a common strategy for visualizing large datasets. Binning is inherent to a few types of visualizations, such as histograms and 2D histograms (also check out their close relatives: 2D density plots and the more general form: heatmaps. While these visualization types explicitly include binning, any type of visualization used with aggregated data can be looked at in the same way. For example, lets say we wanted to look at building construction over time. This would be best viewed as a line graph, but we can still think of our results as being binned by year:

ny <- all_PLUTO_data %>% 
  select(yearbuilt, numfloors, zipcode, bbl, address, assesstot, assessland)


nyc_b <- ny %>%
  filter(yearbuilt >= 1850, yearbuilt < 2020) %>%
  select(yearbuilt, bbl, numfloors) %>%
  group_by(yearbuilt) %>% 
  summarize(count =sum(!is.na(bbl)))

chart <- ggplot(nyc_b, aes(x=yearbuilt, y=count)) +
  geom_line()

chart<-ggplotly(chart)

chart

Question 1

After a few building collapses, the City of New York is going to begin investigating older buildings for safety. The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined. Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). Find a strategy to bin buildings (It should be clear 20-29-story buildings, 30-39-story buildings, and 40-49-story buildings were first built in large numbers, but does it make sense to continue in this way as you get taller?)

nyc_b_d <- ny %>%
  filter(yearbuilt >= 1850, yearbuilt < 2020) %>% 
  select(yearbuilt, bbl, numfloors) %>%
  mutate(decadebuilt = ceiling(yearbuilt/10)*10) %>% 
  group_by(decadebuilt) %>% 
  summarize(Lots_Built =sum(!is.na(bbl)))

head(nyc_b_d)

## # A tibble: 6 x 2
##   decadebuilt Lots_Built
##         <dbl>      <int>
## 1        1850        253
## 2        1860       1788
## 3        1870       1592
## 4        1880       3037
## 5        1890       5608
## 6        1900      31552

chart <- ggplot(nyc_b_d, aes(x=decadebuilt, y=Lots_Built)) +
     geom_bar(stat="identity")


chart <- ggplotly(chart)
  
chart

nyc_fl <- ny %>%
  filter(yearbuilt >= 1850, yearbuilt < 2020) %>% 
  select(yearbuilt, bbl, numfloors) %>%
  mutate(numfloors= round(numfloors,-1)) %>% 
  group_by(yearbuilt, numfloors) %>% 
  count() %>% 
  filter(numfloors >=20, numfloors <= 70) %>% 
  group_by(yearbuilt, numfloors) %>%
  summarise(floor_count = sum(n))

  
chart <- ggplot(nyc_fl, aes(yearbuilt,floor_count)) + geom_point(stat="identity")

chart <- ggplotly(chart)

chart

Question 2

You work for a real estate developer and are researching underbuilt areas of the city. After looking in the Pluto data dictionary, you’ve discovered that all tax assessments consist of two parts: The assessment of the land and assessment of the structure. You reason that there should be a correlation between these two values: more valuable land will have more valuable structures on them (more valuable in this case refers not just to a mansion vs a bungalow, but an apartment tower vs a single family home). Deviations from the norm could represent underbuilt or overbuilt areas of the city. You also recently read a really cool blog post about bivariate choropleth maps, and think the technique could be used for this problem. Datashader is really cool, but it’s not that great at labeling your visualization. Don’t worry about providing a legend, but provide a quick explanation as to which areas of the city are overbuilt, which areas are underbuilt, and which areas are built in a way that’s properly correlated with their land value.

ny_tax <- ny %>% 
  select(assesstot, assessland,  zipcode) %>% 
  mutate(assessbldg = assesstot - assessland) %>% 
  mutate(zipStr = toString(zipcode)) %>% 
  group_by(zipcode) %>% 
  summarise(sum(assesstot), sum(assessland), sum(assessbldg)) %>% 
  mutate(BldgtoLand = `sum(assessbldg)`/`sum(assessland)`) %>% 
  mutate(rank = rank(`sum(assessland)`)) %>% 
  arrange(rank) %>% 
  filter(rank <= 10)

p <- ggplot(ny_tax, aes(x=zipcode, y=rank)) +
  geom_tile(aes(fill=BldgtoLand))
 

p <- ggplotly(p)
  
p

Data 608 Module 2

Tony Mei

9/20/2020