For module 2 we’ll be looking at techniques for dealing with big data. In particular binning strategies and the datashader library (which possibly proves we’ll never need to bin large data for visualization ever again.)

To demonstrate these concepts we’ll be looking at the PLUTO dataset put out by New York City’s department of city planning. PLUTO contains data about every tax lot in New York City.

PLUTO data can be downloaded from here. Unzip them to the same directory as this notebook, and you should be able to read them in using this (or very similar) code. Also take note of the data dictionary, it’ll come in handy for this assignment.

#Libraries

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   1.0.1      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(maps)
## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:purrr':
## 
##     map
library(ggthemes)
library(ggmap)
## ℹ Google's Terms of Service: <]8;;https://mapsplatform.google.comhttps://mapsplatform.google.com]8;;>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
## 
## Attaching package: 'ggmap'
## 
## The following object is masked from 'package:plotly':
## 
##     wind
library(mapdata)

#Import Dataset

ny <- read.csv("C:/Users/Ivant/Desktop/pluto_22v3.csv", header=TRUE)
head(ny)
##   borough block  lot  cd bct2020  bctcb2020 ct2010 cb2010 schooldist council
## 1      MN   574   65 102 1006300 1.0063e+10     63   2001          2       3
## 2      BK  3435   45 304 3041100 3.0411e+10    411   1000         32      37
## 3      BK  3447   29 304 3041100 3.0411e+10    411   1002         32      37
## 4      BX  2514   10 204 2019300 2.0193e+10    193   4001          9       8
## 5      MN   482 7501 102 1004500 1.0045e+10     45   1006          2       1
## 6      BK  3434    8 304 3041100 3.0411e+10    411   2000         32      37
##   zipcode firecomp policeprct healthcenterdistrict healtharea sanitboro
## 1   10011     E033          6                   15       5700         1
## 2   11207     Q252         83                   34       3500         3
## 3   11207     Q252         83                   34       3500         3
## 4   10452     E068         44                   23       3310         2
## 5   10013     E055          5                   15       6800         1
## 6   11207     Q252         83                   34       3500         3
##   sanitdistrict sanitsub             address zonedist1 zonedist2 zonedist3
## 1             2       3A   41 WEST 10 STREET        R6                    
## 2             4       3B   177 COOPER STREET        R6                    
## 3             4       3B   222 MOFFAT STREET      M1-1                    
## 4             4       2A   1082 OGDEN AVENUE      R7-1                    
## 5             2       1A   406 BROOME STREET      C6-2                    
## 6             4       3A 1114 DECATUR STREET        R6                    
##   zonedist4 overlay1 overlay2 spdist1 spdist2 spdist3 ltdheight splitzone
## 1                                                  NA                   N
## 2                                                  NA                   N
## 3                                                  NA                   N
## 4               C1-4                               NA                   N
## 5                                                  NA                   N
## 6                                                  NA                   N
##   bldgclass landuse easements ownertype                ownername lotarea
## 1        C6       2         0                  UNAVAILABLE OWNER    2321
## 2        B2       1         0             WASHINGTON, TIFFANIE L    2000
## 3        B1       1         0                      JDC HOME INC.    2000
## 4        A1       1         0                     PINEDA, ARELIS    2875
## 5        RB       5         0           LAFAYETTE COMMERCL CONDO   11750
## 6        B2       1         0                    NEVELS CARRATHA    1666
##   bldgarea comarea resarea officearea retailarea garagearea strgearea
## 1     6540       0    6540          0          0          0         0
## 2     1800       0    1800          0          0          0         0
## 3     2200       0    2200          0          0          0         0
## 4     1710       0    1710          0          0          0         0
## 5    74349   74349       0       9849      64500          0         0
## 6     2250       0    1500          0          0          0         0
##   factryarea otherarea areasource numbldgs numfloors unitsres unitstotal
## 1          0         0          2        1         4        5          5
## 2          0         0          2        1         2        2          2
## 3          0         0          2        1         2        2          2
## 4          0         0          2        1         2        1          1
## 5          0         0          2        1         7        0         25
## 6          0         0          2        1         2        2          2
##   lotfront lotdepth bldgfront bldgdepth ext proxcode irrlotcode lottype
## 1    24.50    94.75     25.00     80.00   E        3          N       5
## 2    20.00   100.00     20.00     45.00   N        3          N       5
## 3    20.00   100.00     20.00     55.00   N        3          N       5
## 4    25.00   115.00     19.00     42.67   N        1          N       5
## 5   149.50   100.42      0.00      0.00            0          Y       5
## 6    16.67   100.00     16.67     45.00   N        3          N       5
##   bsmtcode assessland assesstot exempttot yearbuilt yearalter1 yearalter2
## 1        1     567000   2223000         0      1839       1989          0
## 2        2      12300     74940      1460      1901          0          0
## 3        2      11940     75780         0      1910          0          0
## 4        2      14340     34560         0      1899          0          0
## 5        5    1143002   6811650         0      1900          0          0
## 6        1      12000     70380      1460      1899          0          0
##                                     histdist landmark builtfar residfar commfar
## 1        Greenwich Village Historic District              2.82     2.43       0
## 2                                                         0.90     2.43       0
## 3                                                         1.10     0.00       1
## 4                                                         0.59     3.44       0
## 5 SoHo-Cast Iron Historic District Extension              6.33     6.02       6
## 6                                                         1.35     2.43       0
##   facilfar borocode        bbl condono tract2010  xcoord ycoord zonemap zmcode
## 1      4.8        1 1005740065      NA        63  984990 206856     12c       
## 2      4.8        3 3034350045      NA       411 1010129 190246     17c       
## 3      2.4        3 3034470029      NA       411 1010530 190085     17c       
## 4      4.8        2 2025140010      NA       193 1004213 243491      3b       
## 5      6.5        1 1004827501     463        45  984876 202030     12c       
## 6      4.8        3 3034340008      NA       411 1009535 189618     17c       
##   sanborn taxmap edesignum     appbbl    appdate plutomapid firm07_flag
## 1 103 019  10207                   NA                     1          NA
## 2 309 021  31109                   NA                     1          NA
## 3 309 021  31109                   NA                     1          NA
## 4 210S041  20908                   NA                     1          NA
## 5 101N073  10206     E-130 1004821001 08/25/1988          1          NA
## 6 309 021  31109                   NA                     1          NA
##   pfirm15_flag version dcpedited latitude longitude notes
## 1           NA    22v3         t 40.73445 -73.99733    NA
## 2           NA    22v3           40.68882 -73.90668    NA
## 3           NA    22v3           40.68838 -73.90524    NA
## 4           NA    22v3           40.83498 -73.92786    NA
## 5           NA    22v3           40.72120 -73.99774    NA
## 6           NA    22v3           40.68710 -73.90883    NA
names(ny) # Get column names
##  [1] "borough"              "block"                "lot"                 
##  [4] "cd"                   "bct2020"              "bctcb2020"           
##  [7] "ct2010"               "cb2010"               "schooldist"          
## [10] "council"              "zipcode"              "firecomp"            
## [13] "policeprct"           "healthcenterdistrict" "healtharea"          
## [16] "sanitboro"            "sanitdistrict"        "sanitsub"            
## [19] "address"              "zonedist1"            "zonedist2"           
## [22] "zonedist3"            "zonedist4"            "overlay1"            
## [25] "overlay2"             "spdist1"              "spdist2"             
## [28] "spdist3"              "ltdheight"            "splitzone"           
## [31] "bldgclass"            "landuse"              "easements"           
## [34] "ownertype"            "ownername"            "lotarea"             
## [37] "bldgarea"             "comarea"              "resarea"             
## [40] "officearea"           "retailarea"           "garagearea"          
## [43] "strgearea"            "factryarea"           "otherarea"           
## [46] "areasource"           "numbldgs"             "numfloors"           
## [49] "unitsres"             "unitstotal"           "lotfront"            
## [52] "lotdepth"             "bldgfront"            "bldgdepth"           
## [55] "ext"                  "proxcode"             "irrlotcode"          
## [58] "lottype"              "bsmtcode"             "assessland"          
## [61] "assesstot"            "exempttot"            "yearbuilt"           
## [64] "yearalter1"           "yearalter2"           "histdist"            
## [67] "landmark"             "builtfar"             "residfar"            
## [70] "commfar"              "facilfar"             "borocode"            
## [73] "bbl"                  "condono"              "tract2010"           
## [76] "xcoord"               "ycoord"               "zonemap"             
## [79] "zmcode"               "sanborn"              "taxmap"              
## [82] "edesignum"            "appbbl"               "appdate"             
## [85] "plutomapid"           "firm07_flag"          "pfirm15_flag"        
## [88] "version"              "dcpedited"            "latitude"            
## [91] "longitude"            "notes"
summary(ny$yearbuilt) #Summary Stats
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    1920    1930    1848    1959    2022     415

##Part 1: Binning and Aggregation Binning is a common strategy for visualizing large datasets. Binning is inherent to a few types of visualizations, such as histograms and 2D histograms (also check out their close relatives: 2D density plots and the more general form: heatmaps.

While these visualization types explicitly include binning, any type of visualization used with aggregated data can be looked at in the same way. For example, lets say we wanted to look at building construction over time. This would be best viewed as a line graph, but we can still think of our results as being binned by year:

ny <- ny %>% 
  select(yearbuilt, numfloors, zipcode, bbl, address, assesstot, assessland)


nyc_b <- ny %>%
  filter(yearbuilt >= 1850, yearbuilt < 2022) %>%
  select(yearbuilt, bbl, numfloors) %>%
  group_by(yearbuilt) %>% 
  summarize(count =sum(!is.na(bbl)))

chart <- ggplot(nyc_b, aes(x=yearbuilt, y=count)) +
  geom_line()

chart<-ggplotly(chart)

chart
nyc_b_d <- ny %>%
  filter(yearbuilt >= 1850, yearbuilt < 2022) %>% 
  select(yearbuilt, bbl, numfloors) %>%
  mutate(decadebuilt = ceiling(yearbuilt/10)*10) %>% 
  group_by(decadebuilt) %>% 
  summarize(Lots_Built =sum(!is.na(bbl)))

head(nyc_b_d)
## # A tibble: 6 × 2
##   decadebuilt Lots_Built
##         <dbl>      <int>
## 1        1850        250
## 2        1860       1806
## 3        1870       1599
## 4        1880       3058
## 5        1890       5647
## 6        1900      31359
chart <- ggplot(nyc_b_d, aes(x=decadebuilt, y=Lots_Built)) +
     geom_bar(stat="identity")


chart <- ggplotly(chart)
  
chart

###Question

After a few building collapses, the City of New York is going to begin investigating older buildings for safety. The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined. Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). Find a strategy to bin buildings (It should be clear 20-29-story buildings, 30-39-story buildings, and 40-49-story buildings were first built in large numbers, but does it make sense to continue in this way as you get taller?)

nyc_fl <- ny %>%
  filter(yearbuilt >= 1850, yearbuilt < 2022) %>% 
  select(yearbuilt, bbl, numfloors) %>%
  mutate(numfloors= round(numfloors,-1)) %>% 
  group_by(yearbuilt, numfloors) %>% 
  count() %>% 
  filter(numfloors >=20, numfloors <= 70) %>% 
  group_by(yearbuilt, numfloors) %>%
  summarise(floor_count = sum(n))
## `summarise()` has grouped output by 'yearbuilt'. You can override using the
## `.groups` argument.
chart <- ggplot(nyc_fl, aes(yearbuilt,floor_count)) + geom_point(stat="identity")

chart <- ggplotly(chart)

chart

###Question You work for a real estate developer and are researching underbuilt areas of the city. After looking in the Pluto data dictionary, you’ve discovered that all tax assessments consist of two parts: The assessment of the land and assessment of the structure. You reason that there should be a correlation between these two values: more valuable land will have more valuable structures on them (more valuable in this case refers not just to a mansion vs a bungalow, but an apartment tower vs a single family home). Deviations from the norm could represent underbuilt or overbuilt areas of the city. You also recently read a really cool blog post about bivariate choropleth maps, and think the technique could be used for this problem.

Datashader is really cool, but it’s not that great at labeling your visualization. Don’t worry about providing a legend, but provide a quick explanation as to which areas of the city are overbuilt, which areas are underbuilt, and which areas are built in a way that’s properly correlated with their land value.

ny_tax <- ny %>% 
  select(assesstot, assessland,  zipcode) %>% 
  mutate(assessbldg = assesstot - assessland) %>% 
  mutate(zipStr = toString(zipcode)) %>% 
  group_by(zipcode) %>% 
  summarise(sum(assesstot), sum(assessland), sum(assessbldg)) %>% 
  mutate(BldgtoLand = `sum(assessbldg)`/`sum(assessland)`) %>% 
  mutate(rank = rank(`sum(assessland)`)) %>% 
  arrange(rank) %>% 
  filter(rank <= 10)

p <- ggplot(ny_tax, aes(x=zipcode, y=rank)) +
  geom_tile(aes(fill=BldgtoLand))
 

p <- ggplotly(p)
  
p