DATA 608: Week 3 Visualization

The Assignment
Setup R workspace
Getting, cleaning the data
Plot 1: Old Buildings
Plot 2: March of the High Rises
Plot 3: WWII Buildings
Coda

The Assignment

This week, we’re asked to visually query a large cache of New York City building data using the RStudio/dpyler/bigvis tool set. This was fun but at times frustrating – particularly getting just the right look on plots. ggplot2 seems infinitely configurable – until you get stumped on something seemingly trivial, like repositioning a plot title. Still it’s a challenge …

For most data projects, getting and preparing the data takes the bulk of the time. Here we had a pretty clean dataset. Most of the time was spent tweaking the plots and making sure that binning under bigvis worked properly. Docs are skimpy.

In this document, most of the data manipulation and coding will be hidden; it’s accessible in the .rmd file. I’ll show the plots with some commentary about each one.

Setup R workspace

We ended up using 11 R packages. I’ll echo them so you can see. The core packages were ggplot2 and bigvis.

library(devtools)
library(ggplot2)
library(dplyr)
library(data.table)
library(ggthemes)
library(psych)
library(car)
library(bigvis)
library(scales)
library(RColorBrewer)
library(colorspace)

Getting, cleaning the data

We downloaded the zip file and used csvkit to merge the separate borough files in the terminal. Per the assignment, steps not shown. We checked borough counts, looked for missing data in key fields. We limited the data to buildings constructed in 1850 and older and fixed some glitches (i.e, Year Built = 2040).

## 
Read 0.0% of 858370 rows
Read 30.3% of 858370 rows
Read 61.7% of 858370 rows
Read 95.5% of 858370 rows
Read 858370 rows and 84 (of 84) columns from 0.255 GB file in 00:00:06

Plot 1: Old Buildings

The question here is to find a “cut-off date before most city buildings were constructed.” Normally that would be a median, but here the data is problematic.

After a bunch of exploratory plots, and after diving into the data dictionary, it became clear that New York did a crap job of recording when a building was finished. For 19th and early 20th century strutures, the ‘Year Built’ more accurately was ‘Decade Built’. So precision becomes a problem and binning is an issue.

To illustrate – this is not our answer, just a preliminary plot – look at the spiking by year. Things remained a mess until the 1990s.

Here is our plot answering the question, “Build a graph to help the city determine when the most buildings were constructed.” To smooth the data, we binned building completion (variable YearBuilt) into 5-year increments. The dotted vertical line shows that half New York’s buildings were completed before 1931 (median number), and the solid vertical lines demarcate the monstrous chunk of buildings completed between 1917 and 1937, the era of the skyscraper. These older buildings are a possible safety concern. This plot relies on binning using the condense() function in bigvis.

Plot 2: March of the High Rises

We are asked to “create a graph that shows how many buildings of a certain number of floors” were built in the city. The plot below shows everything taller than a 5-story walkup. Red-blue demarcates buildings lower/higher than the median stories. Four peaks in high-rise construction are evident: roaring 20s, post-recession ’80s, mid-1990s and post-Great Recession.

Plot 3: WWII Buildings

Were buildings constructed during the Great War years shoddier? The thesis is that assessed values are lower, and that assessed value is proxy for poor construction. The plot bins buildings in 5-year increments based on assessed value per floor. In fact, the 1941-45 bin is in a trough. The curmudgeonly boss appears to be correct. We experimented with the peel() function in bigvis, deleting outlying edges of the data and make the plot more compact. That seemed fake, so we dropped it. The dramatic effect of outliers is clear. Median for the bins is shown by the dotted line; unfortunately, I couldn’t figure out how to label it in ggplot2. Though not shown, some buildings in the late 19th century had even higher per flor assessed values.

Coda

We tried adjusting the values for dollar inflation, but it didn’t make much of a difference. We struggled with intricacies of ggplot2 – how to annotate lines, change the length of lines, update themes, etc. Clearly many challenges to learning this graphics vocabulary.