Mai_Yun_608

Module 2

We will be working with data about every tax lot in NYC, in a zip file. Please download it from http://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page. You will need to combine the data from each boro into one file called ‘all_PLUTO_data.R’ in order to complete this assignment. I encourage you to do this via code but it is not necessary to submit such code.

\(\bullet\) This assignment must be done in a single R script with ggplot2. Use of bigvis is encouraged but not required.

\(\bullet\) Please zip your code and images into a file, named ‘Last_First_hw2.zip’ (mine would be ‘Ferrari_Charley_hw2.zip’.

\(\bullet\) I have created some codes in the sample code file for you- this assumes the data is in a file called ‘all_PLUTO_data’ which you will need to create (as mentioned above).

install_github("devtools")
devtools::install_github("hadley/bigvis")

install.packages("data.table")
install.packages('ggthemes', dependencies = TRUE)

suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(bigvis))) 
suppressWarnings(suppressMessages(library(data.table))) 
suppressWarnings(suppressMessages(library(ggthemes))) 
suppressWarnings(suppressMessages(library(scales)))
suppressWarnings(suppressMessages(library(utils)))

Data

suppressWarnings(suppressMessages(bx <- fread("https://raw.githubusercontent.com/YunMai-SPS/DA608_homework/master/DA608_Assignment_2/BX.csv")))
suppressWarnings(suppressMessages(mn <- fread("https://raw.githubusercontent.com/YunMai-SPS/DA608_homework/master/DA608_Assignment_2/MN.csv")))

# fread function will change the class of data attributes so read.csv function will be used for the bk,qn1,qn2
bk <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA608_homework/master/DA608_Assignment_2/BK.csv")
qn1 <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA608_homework/master/DA608_Assignment_2/QN1.csv")
qn2 <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA608_homework/master/DA608_Assignment_2/QN2.csv")
si <- suppressWarnings(suppressMessages(fread("https://raw.githubusercontent.com/YunMai-SPS/DA608_homework/master/DA608_Assignment_2/SI.csv")))

all_PLUTO_data <-do.call("rbind", list(mn, bx, bk, qn1, qn2, si))
pluto <- all_PLUTO_data

write.csv(all_PLUTO_data.R,"all_PLUTO_data.csv")

Questions 1:

After a few building collapses, the City of New York is going to begin investigating older buildings for safety. However, the city has a limited number of inspectors, and wants to find a ‘cut-off’ date before most city buildings were constructed. Build a graph to help the city determine when most buildings were constructed. Is there anything in the results that causes you to question the accuracy of the data? (note: only look at buildings built since 1850)

Answer:

# get data between 1850 and 2017 from YearBuilt, and remove na and 0 
yb_1 <- pluto[which (pluto$YearBuilt >= 1850 & pluto$YearBuilt <= 2017 & pluto$YearBuilt != 0 & !is.na(pluto$YearBuilt)),]

# condense data
yb <- condense(bin(yb_1$YearBuilt, 5), z=yb_1$YearBuilt, summary="count")

# compare the data before and after condense
p1 <- ggplot(data=yb,aes(x=yb_1.YearBuilt,y=.count)) 
p2 <- p1 + geom_bar(stat="identity") 
p3 <- p2 + ggtitle("YearBuilt Distribution - Condensed")
p4 <- p3 + labs(x="YearBuilt", y="Count")
p4 + theme_tufte(base_size = 11, base_family = "serif", ticks = TRUE)

## Warning: Removed 1 rows containing missing values (position_stack).

p1 <- ggplot(data=yb_1,aes(x=YearBuilt)) 
p2 <- p1 + geom_histogram() 
p3 <- p2 + ggtitle("YearBuilt Distribution")
p4 <- p3 + labs(x="YearBuilt", y="Count")
p4 + theme_tufte(base_size = 11, base_family = "serif", ticks = TRUE)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Very few buildings built between 1850 to 1900 are still standing in the city. The reason for this could be that those building were destroied because they were too old. It is also possible that the built times were not recorded so the data was missing. If the missing YearBuilt data comes mostly from between 1850 and 1900, the data is not accurate.

The distribution patterns before and after condense the data are similar . I will use the condensed data for further analysis.

# calculate the cummulation of housing built through the years
yb$cum <- cumsum(yb$.count)
# calculate the percentage of the cummulated number of housing in the total number of housing
yb$percent <- round(yb$cum/sum(yb$.count),digits=4)*100

# find the year cut-off for 90% of housing were built
yb_filter <- yb[which(yb$percent>=90),]
cutoff_90 <- yb_filter[which(yb_filter$percent== min(yb_filter$percent)),]$yb_1.YearBuilt
print(paste("90% of the buildings were constructed before",cutoff_90))

## [1] "90% of the buildings were constructed before 1987"

# find the cutoff year from the figure
p1 <- ggplot(data=yb,aes(reorder(x=yb_1.YearBuilt,-percent),y=percent))
p2 <- p1 + geom_bar(stat="identity") 
p3 <- p2 + ggtitle("Percentage of Housing Built Through the Years") 
p4 <- p3 + labs(x="YearBuilt",y="Percentage")
p5 <- p4 + geom_hline(yintercept = 90)
p6 <- p5 + coord_flip()
p6 + theme_tufte(base_size = 11, base_family = "serif", ticks = TRUE)

50% of the buildings were constructed before 1927. 90% of the buildings were constructed before 1987. With the limation of inspectors, those buildings built before 1987 could be inspected.

Questions 2:

The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined. Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). It should be clear when 20-story buildings, 30-story buildings, and 40-story buildings were first built in large numbers.

Answer:

# remove records with 0 as number of floors
fl_1 <- pluto[which(pluto$NumFloors != 0 & !is.na(pluto$NumFloors) & pluto$YearBuilt <= 2017 & pluto$YearBuilt != 0 & !is.na(pluto$YearBuilt)),]

# condense data
fl_2 <- condense(bin(fl_1$NumFloors, 5), z=fl_1$NumFloors, summary="count")
fl_3 <- fl_2[!is.na(fl_2$fl_1.NumFloors),] # remove the NA generated by condense

# compare before and after condensing data

p1 <- ggplot(fl_1,aes(x=fl_1$NumFloors))
p2 <- p1 + geom_density(aes(y=..count..), color="black", fill="pink", alpha=0.3) 
p3 <- p2 + scale_x_continuous(breaks=c(0,10,20,30,40,100,200),trans="log1p", expand=c(0,0)) 
p4 <- p3 + scale_y_continuous(breaks=c(0,100,1000,10000,100000,1000000), trans="log1p", expand=c(0,0)) 
p4 + labs(x="Number of Floors",y="Counts")

# bar graph istead of histgram
p1 <- ggplot(fl_3,aes(x=fl_3$fl_1.NumFloors,y=fl_3$.count))
p2 <- p1 + geom_bar(stat="identity")
p3 <- p2 + scale_y_log10() + scale_x_continuous(breaks=seq(0,130,by=10))
p4 <- p3 + labs(x= "Number of Floors", y= "Count")
p4 + theme_tufte()

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 2 rows containing missing values (geom_bar).

The distribution of builing with diffrernt floors shows that three is a trend that the higher the buiding the less of the number of buildings.

# condense data
fl_4 <- with(fl_1,condense(bin(YearBuilt,5),bin(NumFloors,5)))

## Summarising with count

fl_20 <- fl_4[which(fl_4$NumFloors <=20 &fl_4$NumFloors >= 10),]
fl_30 <- fl_4[which(fl_4$NumFloors <=30 &fl_4$NumFloors >= 20),]
fl_40 <- fl_4[which(fl_4$NumFloors <=40 &fl_4$NumFloors >= 30),]

# Distribution of data set
fl_5 <- smooth(fl_4,h=c(10,1))
p1 <- ggplot(fl_5,aes(x=YearBuilt,y=NumFloors))
p2 <- p1 + geom_point() + xlim(1660,2020)
p2 + ggtitle("A View of whole data set")

# graph for 20-story buildings
p1 <- ggplot(fl_20,aes(x=YearBuilt,y=.count))
p2 <- p1 + geom_bar(stat="identity") 
p3 <- p2 + xlim(min(fl_20$YearBuilt),2020) 
p4 <- p3 + scale_y_log10()
p5 <- p4 + ggtitle("The Years when 20-story buildings Were Built")
p5 + labs (y="Count") + theme_tufte()

# graph for 20-story buildings
p1 <- ggplot(fl_30,aes(x=YearBuilt,y=.count))
p2 <- p1 + geom_bar(stat="identity") 
p3 <- p2 + xlim(min(fl_30$YearBuilt),2020) 
p4 <- p3 + scale_y_log10()
p5 <- p4 + ggtitle("The Years when 30-story buildings Were Built")
p5 + labs (y="Count") + theme_tufte()

# graph for 20-story buildings
p1 <- ggplot(fl_40,aes(x=YearBuilt,y=.count))
p2 <- p1 + geom_bar(stat="identity") 
p3 <- p2 + xlim(min(fl_40$YearBuilt),2020) 
p4 <- p3 + scale_y_log10()
p5 <- p4 + ggtitle("The Years when 40-story buildings Were Built")
p5 + labs (y="Count") + theme_tufte()

The 20-story buildings were very popular from 1900 through 2017 even during World War II. The 30-story buildings were popular too from 1900 through 2017 but much less built during World War II. The 40-story buildings were popular between 1927 and 1932, 1957 to 1987, and 1997 to 2012.

Questions 3:

Your boss suspects that buildings constructed during the US’s involvement in World War II (1941-1945) are more poorly constructed than those before and after the way due to the high cost of materials during those years. She thinks that, if you calculate assessed value per floor, you will see lower values for buildings at that time vs before or after. Construct a chart/graph to see if she’s right.

Answer:

#remove the data with 0 assessed value and 0 or na number of floors
vpf <- pluto[which(!is.na(pluto$AssessTot) & pluto$AssessTot !=0 & !is.na(pluto$NumFloors) & pluto$NumFloors !=0 & !is.na(pluto$YearBuilt) & pluto$YearBuilt !=0 & pluto$YearBuilt <= 2030) ,]

# condense data
vpf$ValpFloor <- vpf$AssessTot/vpf$NumFloors
vpf_sub <- vpf[,c("YearBuilt","AssessTot","NumFloors","ValpFloor")]
vpf_con <- condense(bin(vpf_sub$YearBuilt,5), z=vpf_sub$ValpFloor,summary=c("sd"))

# graph for average assessed valuse per floor
x_breaks <- c(seq(1650,1935,by=50),1940,1945,seq(1960,2020,by=50))
colr <- ifelse(x_breaks == 1940| x_breaks == 1945,"red", "black")

p1 <- ggplot(vpf_con,aes(x=vpf_sub.YearBuilt,y=.mean))
p2 <- p1 + geom_bar(stat="identity")
p3 <- p2 + scale_x_continuous(breaks=x_breaks) 
p4 <- p3 + coord_flip()
p5 <- p4 + labs(x="Assessed Value Per Floor", y = "Year Built")
p5 + theme(axis.text.y = element_text(colour = colr))

## Warning: Removed 20 rows containing missing values (position_stack).

As shown in the graph, the assessed value per floor of buildings built at World War II (1941-1945) are lower than that of the majortity buildings built before and after that time. This is consistent to what was expected.

Mai_Yun_608_hw2

Yun Mai

September 22, 2017

Module 2

Data

Questions 1:

Answer:

Questions 2:

Answer:

Questions 3:

Answer: