For module 2 we’ll be looking at techniques for dealing with big data. In particular binning strategies and the datashader library (which possibly proves we’ll never need to bin large data for visualization ever again.)
To demonstrate these concepts we’ll be looking at the PLUTO dataset put out by New York City’s department of city planning. PLUTO contains data about every tax lot in New York City.
PLUTO data can be downloaded from here. Unzip them to the same directory as this notebook, and you should be able to read them in using this (or very similar) code. Also take note of the data dictionary, it’ll come in handy for this assignment.
#Libraries
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(maps)
##
## Attaching package: 'maps'
##
## The following object is masked from 'package:purrr':
##
## map
library(ggthemes)
library(ggmap)
## ℹ Google's Terms of Service: <]8;;https://mapsplatform.google.comhttps://mapsplatform.google.com]8;;>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
##
## Attaching package: 'ggmap'
##
## The following object is masked from 'package:plotly':
##
## wind
library(mapdata)
#Import Dataset
ny <- read.csv("C:/Users/Ivant/Desktop/pluto_22v3.csv", header=TRUE)
head(ny)
## borough block lot cd bct2020 bctcb2020 ct2010 cb2010 schooldist council
## 1 MN 574 65 102 1006300 1.0063e+10 63 2001 2 3
## 2 BK 3435 45 304 3041100 3.0411e+10 411 1000 32 37
## 3 BK 3447 29 304 3041100 3.0411e+10 411 1002 32 37
## 4 BX 2514 10 204 2019300 2.0193e+10 193 4001 9 8
## 5 MN 482 7501 102 1004500 1.0045e+10 45 1006 2 1
## 6 BK 3434 8 304 3041100 3.0411e+10 411 2000 32 37
## zipcode firecomp policeprct healthcenterdistrict healtharea sanitboro
## 1 10011 E033 6 15 5700 1
## 2 11207 Q252 83 34 3500 3
## 3 11207 Q252 83 34 3500 3
## 4 10452 E068 44 23 3310 2
## 5 10013 E055 5 15 6800 1
## 6 11207 Q252 83 34 3500 3
## sanitdistrict sanitsub address zonedist1 zonedist2 zonedist3
## 1 2 3A 41 WEST 10 STREET R6
## 2 4 3B 177 COOPER STREET R6
## 3 4 3B 222 MOFFAT STREET M1-1
## 4 4 2A 1082 OGDEN AVENUE R7-1
## 5 2 1A 406 BROOME STREET C6-2
## 6 4 3A 1114 DECATUR STREET R6
## zonedist4 overlay1 overlay2 spdist1 spdist2 spdist3 ltdheight splitzone
## 1 NA N
## 2 NA N
## 3 NA N
## 4 C1-4 NA N
## 5 NA N
## 6 NA N
## bldgclass landuse easements ownertype ownername lotarea
## 1 C6 2 0 UNAVAILABLE OWNER 2321
## 2 B2 1 0 WASHINGTON, TIFFANIE L 2000
## 3 B1 1 0 JDC HOME INC. 2000
## 4 A1 1 0 PINEDA, ARELIS 2875
## 5 RB 5 0 LAFAYETTE COMMERCL CONDO 11750
## 6 B2 1 0 NEVELS CARRATHA 1666
## bldgarea comarea resarea officearea retailarea garagearea strgearea
## 1 6540 0 6540 0 0 0 0
## 2 1800 0 1800 0 0 0 0
## 3 2200 0 2200 0 0 0 0
## 4 1710 0 1710 0 0 0 0
## 5 74349 74349 0 9849 64500 0 0
## 6 2250 0 1500 0 0 0 0
## factryarea otherarea areasource numbldgs numfloors unitsres unitstotal
## 1 0 0 2 1 4 5 5
## 2 0 0 2 1 2 2 2
## 3 0 0 2 1 2 2 2
## 4 0 0 2 1 2 1 1
## 5 0 0 2 1 7 0 25
## 6 0 0 2 1 2 2 2
## lotfront lotdepth bldgfront bldgdepth ext proxcode irrlotcode lottype
## 1 24.50 94.75 25.00 80.00 E 3 N 5
## 2 20.00 100.00 20.00 45.00 N 3 N 5
## 3 20.00 100.00 20.00 55.00 N 3 N 5
## 4 25.00 115.00 19.00 42.67 N 1 N 5
## 5 149.50 100.42 0.00 0.00 0 Y 5
## 6 16.67 100.00 16.67 45.00 N 3 N 5
## bsmtcode assessland assesstot exempttot yearbuilt yearalter1 yearalter2
## 1 1 567000 2223000 0 1839 1989 0
## 2 2 12300 74940 1460 1901 0 0
## 3 2 11940 75780 0 1910 0 0
## 4 2 14340 34560 0 1899 0 0
## 5 5 1143002 6811650 0 1900 0 0
## 6 1 12000 70380 1460 1899 0 0
## histdist landmark builtfar residfar commfar
## 1 Greenwich Village Historic District 2.82 2.43 0
## 2 0.90 2.43 0
## 3 1.10 0.00 1
## 4 0.59 3.44 0
## 5 SoHo-Cast Iron Historic District Extension 6.33 6.02 6
## 6 1.35 2.43 0
## facilfar borocode bbl condono tract2010 xcoord ycoord zonemap zmcode
## 1 4.8 1 1005740065 NA 63 984990 206856 12c
## 2 4.8 3 3034350045 NA 411 1010129 190246 17c
## 3 2.4 3 3034470029 NA 411 1010530 190085 17c
## 4 4.8 2 2025140010 NA 193 1004213 243491 3b
## 5 6.5 1 1004827501 463 45 984876 202030 12c
## 6 4.8 3 3034340008 NA 411 1009535 189618 17c
## sanborn taxmap edesignum appbbl appdate plutomapid firm07_flag
## 1 103 019 10207 NA 1 NA
## 2 309 021 31109 NA 1 NA
## 3 309 021 31109 NA 1 NA
## 4 210S041 20908 NA 1 NA
## 5 101N073 10206 E-130 1004821001 08/25/1988 1 NA
## 6 309 021 31109 NA 1 NA
## pfirm15_flag version dcpedited latitude longitude notes
## 1 NA 22v3 t 40.73445 -73.99733 NA
## 2 NA 22v3 40.68882 -73.90668 NA
## 3 NA 22v3 40.68838 -73.90524 NA
## 4 NA 22v3 40.83498 -73.92786 NA
## 5 NA 22v3 40.72120 -73.99774 NA
## 6 NA 22v3 40.68710 -73.90883 NA
names(ny) # Get column names
## [1] "borough" "block" "lot"
## [4] "cd" "bct2020" "bctcb2020"
## [7] "ct2010" "cb2010" "schooldist"
## [10] "council" "zipcode" "firecomp"
## [13] "policeprct" "healthcenterdistrict" "healtharea"
## [16] "sanitboro" "sanitdistrict" "sanitsub"
## [19] "address" "zonedist1" "zonedist2"
## [22] "zonedist3" "zonedist4" "overlay1"
## [25] "overlay2" "spdist1" "spdist2"
## [28] "spdist3" "ltdheight" "splitzone"
## [31] "bldgclass" "landuse" "easements"
## [34] "ownertype" "ownername" "lotarea"
## [37] "bldgarea" "comarea" "resarea"
## [40] "officearea" "retailarea" "garagearea"
## [43] "strgearea" "factryarea" "otherarea"
## [46] "areasource" "numbldgs" "numfloors"
## [49] "unitsres" "unitstotal" "lotfront"
## [52] "lotdepth" "bldgfront" "bldgdepth"
## [55] "ext" "proxcode" "irrlotcode"
## [58] "lottype" "bsmtcode" "assessland"
## [61] "assesstot" "exempttot" "yearbuilt"
## [64] "yearalter1" "yearalter2" "histdist"
## [67] "landmark" "builtfar" "residfar"
## [70] "commfar" "facilfar" "borocode"
## [73] "bbl" "condono" "tract2010"
## [76] "xcoord" "ycoord" "zonemap"
## [79] "zmcode" "sanborn" "taxmap"
## [82] "edesignum" "appbbl" "appdate"
## [85] "plutomapid" "firm07_flag" "pfirm15_flag"
## [88] "version" "dcpedited" "latitude"
## [91] "longitude" "notes"
summary(ny$yearbuilt) #Summary Stats
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 1920 1930 1848 1959 2022 415
##Part 1: Binning and Aggregation Binning is a common strategy for visualizing large datasets. Binning is inherent to a few types of visualizations, such as histograms and 2D histograms (also check out their close relatives: 2D density plots and the more general form: heatmaps.
While these visualization types explicitly include binning, any type of visualization used with aggregated data can be looked at in the same way. For example, lets say we wanted to look at building construction over time. This would be best viewed as a line graph, but we can still think of our results as being binned by year:
ny <- ny %>%
select(yearbuilt, numfloors, zipcode, bbl, address, assesstot, assessland)
nyc_b <- ny %>%
filter(yearbuilt >= 1850, yearbuilt < 2022) %>%
select(yearbuilt, bbl, numfloors) %>%
group_by(yearbuilt) %>%
summarize(count =sum(!is.na(bbl)))
chart <- ggplot(nyc_b, aes(x=yearbuilt, y=count)) +
geom_line()
chart<-ggplotly(chart)
chart
nyc_b_d <- ny %>%
filter(yearbuilt >= 1850, yearbuilt < 2022) %>%
select(yearbuilt, bbl, numfloors) %>%
mutate(decadebuilt = ceiling(yearbuilt/10)*10) %>%
group_by(decadebuilt) %>%
summarize(Lots_Built =sum(!is.na(bbl)))
head(nyc_b_d)
## # A tibble: 6 × 2
## decadebuilt Lots_Built
## <dbl> <int>
## 1 1850 250
## 2 1860 1806
## 3 1870 1599
## 4 1880 3058
## 5 1890 5647
## 6 1900 31359
chart <- ggplot(nyc_b_d, aes(x=decadebuilt, y=Lots_Built)) +
geom_bar(stat="identity")
chart <- ggplotly(chart)
chart
###Question
After a few building collapses, the City of New York is going to begin investigating older buildings for safety. The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined. Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). Find a strategy to bin buildings (It should be clear 20-29-story buildings, 30-39-story buildings, and 40-49-story buildings were first built in large numbers, but does it make sense to continue in this way as you get taller?)
nyc_fl <- ny %>%
filter(yearbuilt >= 1850, yearbuilt < 2022) %>%
select(yearbuilt, bbl, numfloors) %>%
mutate(numfloors= round(numfloors,-1)) %>%
group_by(yearbuilt, numfloors) %>%
count() %>%
filter(numfloors >=20, numfloors <= 70) %>%
group_by(yearbuilt, numfloors) %>%
summarise(floor_count = sum(n))
## `summarise()` has grouped output by 'yearbuilt'. You can override using the
## `.groups` argument.
chart <- ggplot(nyc_fl, aes(yearbuilt,floor_count)) + geom_point(stat="identity")
chart <- ggplotly(chart)
chart
###Question You work for a real estate developer and are researching underbuilt areas of the city. After looking in the Pluto data dictionary, you’ve discovered that all tax assessments consist of two parts: The assessment of the land and assessment of the structure. You reason that there should be a correlation between these two values: more valuable land will have more valuable structures on them (more valuable in this case refers not just to a mansion vs a bungalow, but an apartment tower vs a single family home). Deviations from the norm could represent underbuilt or overbuilt areas of the city. You also recently read a really cool blog post about bivariate choropleth maps, and think the technique could be used for this problem.
Datashader is really cool, but it’s not that great at labeling your visualization. Don’t worry about providing a legend, but provide a quick explanation as to which areas of the city are overbuilt, which areas are underbuilt, and which areas are built in a way that’s properly correlated with their land value.
ny_tax <- ny %>%
select(assesstot, assessland, zipcode) %>%
mutate(assessbldg = assesstot - assessland) %>%
mutate(zipStr = toString(zipcode)) %>%
group_by(zipcode) %>%
summarise(sum(assesstot), sum(assessland), sum(assessbldg)) %>%
mutate(BldgtoLand = `sum(assessbldg)`/`sum(assessland)`) %>%
mutate(rank = rank(`sum(assessland)`)) %>%
arrange(rank) %>%
filter(rank <= 10)
p <- ggplot(ny_tax, aes(x=zipcode, y=rank)) +
geom_tile(aes(fill=BldgtoLand))
p <- ggplotly(p)
p