IS608_Hw2

Load packages

Load data

Data Analysis & Presentation

Question 1

After a few building collapses, the City of New York is going to begin investigating older buildings for safety. However, the city has a limited number of inspectors, and wants to find a ‘cut-off’ date before most city buildings were constructed.

Build a graph to help the city determine when most buildings were constructed. Is there anything in the results that causes you to question the accuracy of the data? (note: only look at buildings built since 1850)

First we filter by post-1850 constructed buildings we also filter by the following criteria . . .

Lot Area is > 100sf
Assessed Value is < $10,000,000
At least 1 floor

After aggregating data in 5 year bins, it becomes clear that between 1852 to 1892 building counts are too small to provide comparable statistics to that collected after 1900. One bin 1867-72 has a mean that is a signficant outlier which may be related to the small building count.

Therefore, 1900 is used as the preferred cutoff date.

# does lot area change with year of construction?
pData1 <- pData %>%
  filter(YearBuilt > 1850, LotArea > 100, AssessTot < 10000000, NumFloors != 0, YearBuilt ) %>%
  select(LotArea, YearBuilt)

# let's plot 5 year averages
yr <- with(pData1, condense(bin(YearBuilt, 5), z=LotArea))

## Summarising with mean

yr <- yr[complete.cases(yr),]
# notice the jump in building count ~1900
head(yr,15)

##    YearBuilt .count     .mean
## 2       1852     37  6051.973
## 3       1857     47  5107.191
## 4       1862     42  3050.976
## 5       1867     34 89930.441
## 6       1872     87  9534.782
## 7       1877     25  4994.480
## 8       1882    220  5271.145
## 9       1887     88 15491.273
## 10      1892    478  4087.471
## 11      1897  25223  2474.517
## 12      1902  36687  3204.654
## 13      1907  10191  3778.891
## 14      1912  49448  3154.475
## 15      1917  18186  3719.373
## 16      1922  97988  3457.494

autoplot(yr) + xlim(1850, 2014) + ylim(0, 100000) + ylab('Lot Area')

ggplot(yr) + geom_line(aes(x=YearBuilt, y=.mean)) + 
  geom_point(aes(x=YearBuilt, y=.mean, color = .count)) +
  xlim(1900, 2014) + ylim(0, 10000) + ylab('Lot Area') + 
  labs(title="NYC Building Average Lot Size: Log-count of Occurances (n)", x="Year Built", y="Lot Size sqft") +
  scale_color_gradient(trans = "log")

## Warning: Removed 10 rows containing missing values (geom_path).

## Warning: Removed 10 rows containing missing values (geom_point).

Question 2

The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined.

Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). It should be clear when 20-story buildings, 30-story buildings, and 40-story buildings were first built in large numbers.

# Aggregate counts after grouping by YearBuilt and NumFloors
pData2 <- pData %>%
  filter(YearBuilt > 1850, LotArea > 100, NumFloors != 0, YearBuilt ) %>%
    select(NumFloors, YearBuilt)  %>%
      group_by(YearBuilt, NumFloors) %>%
        tally() %>%
          arrange(YearBuilt,desc(NumFloors),n)
pData2$n <- log(pData2$n)

ggplot(data=pData2, aes(x=YearBuilt, y=NumFloors, fill=n)) +
geom_bar(aes(fill = n), position = "dodge", stat="identity") +
labs(title="NYC Building Height by Year: Log-count of Occurances (n)", x="Year Built", y="Number of Floors") +
scale_color_gradient(trans = "log")

Question 3

Your boss suspects that buildings constructed during the US’s involvement in World War II (1941-1945) are more poorly constructed than those before and after the way due to the high cost of materials during those years. She thinks that, if you calculate assessed value per floor, you will see lower values for buildings at that time vs before or after. Construct a chart/graph to see if she’s right.

There are a few cavaets about the data here. First, we don’t know if the value of the dollar is adjusted so that we are doing a valid economic comparison. Then there is the issue with small the small data sets that exist before ~1900 as was investigated in Question 1, although it may still be useful to see the data. Finally, there is normal cyclical fluctutation in prices so attributing poor construction to high costs of materials may not be valid.

Given these points, there is a dip during the war years so it could be a material costs could have been a contributing factor.

# does lot area change with year of construction?
pData3 <- pData %>%
  filter(LotArea > 100, NumFloors != 0, YearBuilt ) %>%
    mutate(YearBuilt, ValPerFlr = AssessTot/NumFloors) %>%
    select(YearBuilt, ValPerFlr)

# let's plot 5 year averages
yr <- with(pData3, condense(bin(YearBuilt, 5), z=ValPerFlr))

## Summarising with mean

ggplot(yr) + geom_line(aes(x=YearBuilt, y=.mean)) + 
  geom_point(aes(x=YearBuilt, y=.mean, color = .count)) +
  xlim(1850, 2014) + ylim(0, 500000) +
  labs(title="NYC Building Costs:  Value Per Floor in Dollars", x="Year Built", y="Value in Dollars") +
  scale_color_gradient(trans = "log")

## Warning: Removed 39 rows containing missing values (geom_path).

## Warning: Removed 39 rows containing missing values (geom_point).