Data Preparation

# load data
library(jsonlite)
library(knitr)
library(curl)
library(XML)
library(xml2)
library(ggplot2)
url <- curl(url = "https://data.cityofnewyork.us/resource/uvpi-gqnh.json")
tree_df <- fromJSON(url)
head(tree_df)


##I'm still working on scraping the web for this data
##url2 <- curl(url = "http://www.zillow.com/webservice/GetRegionChildren.htm?zws-id=<X1-ZWz17zfucflgcr_8jmj2>&state=ny&city=newyorkcity&childtype=neighborhood")
##redf <- xmlParse(url2)
##head(redf)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Does tree health correlate to rent prices?

Cases

What are the cases, and how many are there?

There are 683,788 cases, and the cases are individual trees

Data collection

Describe the method of data collection.

Through census collection, people went around NYC and took down information of every tree

Type of study

What type of study is this (observational/experiment)?

observational

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is the perceived health of the tree, so the response variable is qualitative.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

There are a few explanatory variables. One is the property price (average rent) of the neighborhood, which is quantitative and another is or how invested people are in taking care of the trees. Also, people’s perception of tree health could differ and cause inaccuracies in the data.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(tree_df)
##    address              bbl                bin           
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    block_id           boro_ct            borocode        
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    boroname          brch_light         brch_other       
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   brch_shoe            cb_num          census_tract      
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    cncldist         council_district    created_at       
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    curb_loc            guards             health         
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    latitude          longitude             nta           
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    nta_name           problems          root_grate       
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   root_other         root_stone          sidewalk        
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   spc_common         spc_latin           st_assem        
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   st_senate            state              status         
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    steward           stump_diam          tree_dbh        
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    tree_id           trnk_light         trnk_other       
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   trunk_wire         user_type             x_sp          
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      y_sp             zip_city           zipcode         
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
tree_df <- na.omit(tree_df)
plot(factor(tree_df$health), xlab = "tree health")

plot(factor(tree_df$boroname), xlab = "borough")

with(tree_df, table(tree_df$boroname, tree_df$health))
##                
##                 Fair Good Poor
##   Bronx           16   64    6
##   Brooklyn        39  253   17
##   Manhattan       32  159    5
##   Queens          74  191   25
##   Staten Island   19   60    0

Explanatory: What is the explanatory variable(s), and what type is it (numerical/categorical)?

There are a few explanatory variables. One could be the property price of the neighborhood or how invested people are in taking care of the trees. Also, people’s perception of tree health could differ and cause inaccuracies in the data.

I want to graph the data in a few ways. I want to plot the trees of a map, so we can see the density of the trees and I also want to reshape the data such that each row is be a neighborhood and the variables are average cost of the property value, average tree diameter, number of trees, and average tree health. I still need to find a database with reliable information about cost of property by neighborhood.