After a few building collapses, the City of New York is going to begin investigating older buildings for safety. However, the city has a limited number of inspectors, and wants to find a ‘cut-off’ date before most city buildings were constructed.
Build a graph to help the city determine when most buildings were constructed. Is there anything in the results that causes you to question the accuracy of the data? (note: only look at buildings built since 1850)
First we filter by post-1850 constructed buildings we also filter by the following criteria . . .
After aggregating data in 5 year bins, it becomes clear that between 1852 to 1892 building counts are too small to provide comparable statistics to that collected after 1900. One bin 1867-72 has a mean that is a signficant outlier which may be related to the small building count.
Therefore, 1900 is used as the preferred cutoff date.
# does lot area change with year of construction?
pData1 <- pData %>%
filter(YearBuilt > 1850, LotArea > 100, AssessTot < 10000000, NumFloors != 0, YearBuilt ) %>%
select(LotArea, YearBuilt)
# let's plot 5 year averages
yr <- with(pData1, condense(bin(YearBuilt, 5), z=LotArea))
## Summarising with mean
yr <- yr[complete.cases(yr),]
# notice the jump in building count ~1900
head(yr,15)
## YearBuilt .count .mean
## 2 1852 37 6051.973
## 3 1857 47 5107.191
## 4 1862 42 3050.976
## 5 1867 34 89930.441
## 6 1872 87 9534.782
## 7 1877 25 4994.480
## 8 1882 220 5271.145
## 9 1887 88 15491.273
## 10 1892 478 4087.471
## 11 1897 25223 2474.517
## 12 1902 36687 3204.654
## 13 1907 10191 3778.891
## 14 1912 49448 3154.475
## 15 1917 18186 3719.373
## 16 1922 97988 3457.494
autoplot(yr) + xlim(1850, 2014) + ylim(0, 100000) + ylab('Lot Area')
ggplot(yr) + geom_line(aes(x=YearBuilt, y=.mean)) +
geom_point(aes(x=YearBuilt, y=.mean, color = .count)) +
xlim(1900, 2014) + ylim(0, 10000) + ylab('Lot Area') +
labs(title="NYC Building Average Lot Size: Log-count of Occurances (n)", x="Year Built", y="Lot Size sqft") +
scale_color_gradient(trans = "log")
## Warning: Removed 10 rows containing missing values (geom_path).
## Warning: Removed 10 rows containing missing values (geom_point).
The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined.
Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). It should be clear when 20-story buildings, 30-story buildings, and 40-story buildings were first built in large numbers.
# Aggregate counts after grouping by YearBuilt and NumFloors
pData2 <- pData %>%
filter(YearBuilt > 1850, LotArea > 100, NumFloors != 0, YearBuilt ) %>%
select(NumFloors, YearBuilt) %>%
group_by(YearBuilt, NumFloors) %>%
tally() %>%
arrange(YearBuilt,desc(NumFloors),n)
pData2$n <- log(pData2$n)
ggplot(data=pData2, aes(x=YearBuilt, y=NumFloors, fill=n)) +
geom_bar(aes(fill = n), position = "dodge", stat="identity") +
labs(title="NYC Building Height by Year: Log-count of Occurances (n)", x="Year Built", y="Number of Floors") +
scale_color_gradient(trans = "log")
Your boss suspects that buildings constructed during the US’s involvement in World War II (1941-1945) are more poorly constructed than those before and after the way due to the high cost of materials during those years. She thinks that, if you calculate assessed value per floor, you will see lower values for buildings at that time vs before or after. Construct a chart/graph to see if she’s right.
There are a few cavaets about the data here. First, we don’t know if the value of the dollar is adjusted so that we are doing a valid economic comparison. Then there is the issue with small the small data sets that exist before ~1900 as was investigated in Question 1, although it may still be useful to see the data. Finally, there is normal cyclical fluctutation in prices so attributing poor construction to high costs of materials may not be valid.
Given these points, there is a dip during the war years so it could be a material costs could have been a contributing factor.
# does lot area change with year of construction?
pData3 <- pData %>%
filter(LotArea > 100, NumFloors != 0, YearBuilt ) %>%
mutate(YearBuilt, ValPerFlr = AssessTot/NumFloors) %>%
select(YearBuilt, ValPerFlr)
# let's plot 5 year averages
yr <- with(pData3, condense(bin(YearBuilt, 5), z=ValPerFlr))
## Summarising with mean
ggplot(yr) + geom_line(aes(x=YearBuilt, y=.mean)) +
geom_point(aes(x=YearBuilt, y=.mean, color = .count)) +
xlim(1850, 2014) + ylim(0, 500000) +
labs(title="NYC Building Costs: Value Per Floor in Dollars", x="Year Built", y="Value in Dollars") +
scale_color_gradient(trans = "log")
## Warning: Removed 39 rows containing missing values (geom_path).
## Warning: Removed 39 rows containing missing values (geom_point).