##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'ggthemes' was built under R version 3.6.2
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Warning: package 'plotly' was built under R version 3.6.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'tidyr' was built under R version 3.6.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x gridExtra::combine() masks dplyr::combine()
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
## Warning: package 'ggmap' was built under R version 3.6.2
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
##
## Attaching package: 'ggmap'
## The following object is masked from 'package:plotly':
##
## wind
## Warning: package 'mapdata' was built under R version 3.6.2
## Loading required package: Rcpp
##
## Attaching package: 'bigvis'
## The following object is masked from 'package:stats':
##
## smooth
For module 2 we’ll be looking at techniques for dealing with big data. In particular binning strategies and the datashader library (which possibly proves we’ll never need to bin large data for visualization ever again.)
To demonstrate these concepts we’ll be looking at the PLUTO dataset put out by New York City’s department of city planning. PLUTO contains data about every tax lot in New York City.
PLUTO data can be downloaded from here. Unzip them to the same directory as this notebook, and you should be able to read them in using this (or very similar) code. Also take note of the data dictionary, it’ll come in handy for this assignment.
This is Big Data which is too Big for my GitHub to store
# Locate the url and download Pluto zip file
pluto_url <- 'http://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/'
zip_file <- 'nyc_pluto_17v1_1.zip' #This is the current pluto data
download.file(paste0(pluto_url,zip_file),zip_file)# downloading the pluto data
unzip(zip_file) # unzip files
sub_dir <- 'BORO_zip_files_csv/'# This is a subdirectory for the unzipped file
boro <- c('BK','BX','MN','QN','SI') # County name prefixes within city BoroAfter a few building collapses, the City of New York is going to begin investigating older buildings for safety. The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined. Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). Find a strategy to bin buildings (It should be clear 20-29-story buildings, 30-39-story buildings, and 40-49-story buildings were first built in large numbers, but does it make sense to continue in this way as you get taller?)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1920 1930 1841 1955 2040
## # A tibble: 1 x 1
## n
## <int>
## 1 44180
## # A tibble: 1 x 1
## n
## <int>
## 1 1
subsetpluto <- subset(pluto, YearBuilt != 0)
subsetpluto <- subset(subsetpluto, YearBuilt <= 2016)
summary(subsetpluto$YearBuilt)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1661 1920 1931 1941 1960 2016
This shows the number of Buildings built since 1850
build1850 <- subset(subsetpluto, YearBuilt > 1850)
combine1850 <- condense(bin(build1850$YearBuilt, 10)) #Built in 10 years (decades) intervals## Summarising with count
## build1850.YearBuilt .count
## 1 NA 0
## 2 1854.5 86
## 3 1864.5 83
## 4 1874.5 113
## 5 1884.5 333
## 6 1894.5 25593
## 7 1904.5 46151
## 8 1914.5 67537
## 9 1924.5 181404
## 10 1934.5 140977
## 11 1944.5 66869
## 12 1954.5 80837
## 13 1964.5 63059
## 14 1974.5 32745
## 15 1984.5 27514
## 16 1994.5 29219
## 17 2004.5 43575
## 18 2014.5 7629
## build1850.YearBuilt .count
## Min. :1854 Min. : 0
## 1st Qu.:1894 1st Qu.: 2157
## Median :1934 Median : 30982
## Mean :1934 Mean : 45207
## 3rd Qu.:1974 3rd Qu.: 65917
## Max. :2014 Max. :181404
## NA's :1
When were most buildings constructed
comb1850 <- within(combine1850, cumulative_sum <- cumsum(.count))
morebuilt <- sum(combine1850$.count)/2
yearsbuilt <- filter(comb1850, cumulative_sum > morebuilt)
baseyear <- min(yearsbuilt[1]) # this is my base year for comparison(1934)Plot showing when most buildings were constructed since 1850. It seems most buildings were constucted around 1920 to 1929.
plotbuilt <- autoplot(combine1850) +
labs(title="Buildings Built Yearly Since 1850") +
geom_vline(aes(xintercept = baseyear), colour="red") +
annotate("text", x = 1945, y = 150000, label = baseyear, colour="red", size = 4) +
labs(title="Count of Buildings Built in NYC Since 1850", x = "Year", y= "Lots Built")
plotbuiltNumber of floors by Buildings
nyfloors <- nydata %>%
filter(YearBuilt >= 1850, YearBuilt < 2020) %>%
select(YearBuilt, BBL, NumFloors) %>%
mutate(NumFloors= round(NumFloors,-1)) %>%
group_by(YearBuilt, NumFloors) %>%
count() %>%
filter(NumFloors >=20, NumFloors <= 100) %>%
group_by(YearBuilt, NumFloors) %>%
summarise(Floors = sum(n))The interractive graph below shows the number of floors for buildings builts between 1850 to 2020.
plotfloors <- ggplot(nyfloors, aes(x=YearBuilt, y=Floors)) + geom_point(position='jitter') +
theme_bw() + scale_y_continuous(trans='log10') +
scale_x_continuous(breaks = seq(1850,2020, 20)) +
labs(x = 'Year Built', title = 'NYC Buildings By Floors') +
facet_wrap( ~ NumFloors)
plotfloors <- ggplotly(plotfloors)
plotfloorsYou work for a real estate developer and are researching underbuilt areas of the city. After looking in the Pluto data dictionary, you’ve discovered that all tax assessments consist of two parts: The assessment of the land and assessment of the structure. You reason that there should be a correlation between these two values: more valuable land will have more valuable structures on them (more valuable in this case refers not just to a mansion vs a bungalow, but an apartment tower vs a single family home). Deviations from the norm could represent underbuilt or overbuilt areas of the city. You also recently read a really cool blog post about bivariate choropleth maps, and think the technique could be used for this problem.
Datashader is really cool, but it’s not that great at labeling your visualization. Don’t worry about providing a legend, but provide a quick explanation as to which areas of the city are overbuilt, which areas are underbuilt, and which areas are built in a way that’s properly correlated with their land value.
citytax <- nydata %>%
select(AssessTot, AssessLand, ZipCode) %>%
mutate(AssessBldg = AssessTot - AssessLand) %>%
mutate(zipStr = toString(ZipCode)) %>%
group_by(ZipCode) %>%
summarise(sum(AssessTot), sum(AssessLand), sum(AssessBldg)) %>%
mutate(BldgLand = `sum(AssessBldg)`/`sum(AssessLand)`) %>%
mutate(Freq = rank(`sum(AssessLand)`)) %>%
arrange(Freq) %>%
filter(Freq <= 15)The interactive graph below shows the areas of the city that are underbuilt or overbuilt by zip code. The areas around 11697 (Rockaway), 10803 (Pelham) and 10464 (Bronx) seems to be underbuilt while zip codes 11109 (Long Island City) and 11241 (Brooklyn) seems to be over built.
overbuilt <- ggplot(citytax, aes(x=ZipCode, y=Freq),colour=group) +
geom_tile(aes(fill=BldgLand)) +
scale_x_continuous(breaks = seq(min(10000), max(12000), by = 200)) +
scale_y_continuous(breaks = seq(min(1), max(10), by = 1))
overbuilt <- ggplotly(overbuilt,tooltip = c("x", "y", "colour"))
overbuilt