MATH2349 Semester 2, 2018

Required packages

Packages needed during this preprocessing task.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(outliers)
library(forecast)

Executive Summary

The data is first examined for understanding of the fields and their meanings. Each field is considered and the correct data type is determined. Accordingly, fields are cast in different data types. After this, the data is manipulated to fit a tidy format, whereby each row represents one observation, each column one particular variable and the intersection of these has only one atomic piece of information encoded in it. Then, a field is created out of the geolocation fields latitude and longitude, representing the distance of each tree from the central point of Times Square, Manhattan. Following this, the tidy dataset is scanned for missing values - which are duly imputed or otherwise dealt with. Then, outliers are found and considered in each numerical field. In this case, the logical reason for the outliers was quite clear. Finally, the distance from Times Square field is redefined in terms of its z score standardised values to better represent the main reason behind the creation of the field.

Data

The dataset used in this project is from the 2015 Street Tree Census, which is conducted by volunteers organized by the New York City Department of Parks and Recreation. In this census process, 683,788 trees are inventoried by address, and identified by tree species, diameter, and condition. There are numerous other fields captured in this data, such as the exact geographic location, formal/colloquial name of the tree and category of the observer person (volunteer/NYC Parks & Rec Staff). The dataset can be found at the NYC OpenData 2015 Tree Census website. There were two tables used, one representing the information about the observed trees and another acting as a reference table of detailed information about the particular species of trees. The actual instance of the data was sourced through the Public Datasets section of Google BigQuery, as such some of the full 45 fields recorded were not present.

Each variable, its data type and a brief description is outlined below.

Tree Census Points Table

tree_id: Integer, Unique identification number for each tree point

block_id: Integer, Identifier linking each tree to the block in the blockface table/shapefile that it is mapped on

created_at: Date, The date tree points were collected in the census software

tree_dbh: Integer, Diameter of the tree, measured at approximately 54" / 137cm above the ground. This field only takes a non-zero value if the tree is of the “Alive” or “Dead” statuses, not a stump

stump_diam: Integer, Diameter of stump measured through the center, rounded to the nearest inch. This field only takes a non-zero value if the tree measured is a stump

curb_loc: Character, Location of tree bed in relationship to the curb; trees are either along the curb (OnCurb) or offset from the curb (OffsetFromCurb)

status: Character, Indicates whether the tree is alive, standing dead, or a stump

health: Character, Indicates the user’s perception of tree health

spc_latin: Character, Scientific name for species, e.g. “Acer rubrum”

spc_common: Character, Common name for species, e.g. “red maple”

steward: Character, Indicates the number of unique signs of stewardship observed for this tree. Not recorded for stumps or dead trees

guards: Character, Indicates whether a guard is present, and if the user felt it was a helpful or harmful guard. Not recorded for dead trees and stumps

sidewalk: Character, Indicates whether one of the sidewalk flags immediately adjacent to the tree was damaged, cracked, or lifted. Not recorded for dead trees and stumps

user_type: Character, This field describes the category of user who collected this tree point’s data

problems: Character, A list of all problems observed with this tree. Also represented in categorical format in the next 9 fields

root_stone: Character, Indicates the presence of a root problem caused by paving stones in tree bed

root_grate: Character, Indicates the presence of a root problem caused by metal grates in tree bed

root_other: Character, Indicates the presence of other root problems

trunk_wire: Character, Indicates the presence of a trunk problem caused by wires or rope wrapped around the trunk

trnk_light: Character, Indicates the presence of a trunk problem caused by lighting installed on the tree

trnk_other: Character, Indicates the presence of other trunk problems

brch_light: Character, Indicates the presence of a branch problem caused by lights (usually string lights) or wires in the branches

brch_shoe: Character, Indicates the presence of a branch problem caused by sneakers in the branches

brch_other: Character, Indicates the presence of other branch problems

address: Character, Nearest estimated address to tree

zipcode: Integer, Five-digit zipcode in which tree is located

zip_city: Character, City as derived from zipcode. This is often (but not always) the same as borough

cb_num: Integer, Community board in which tree point is located

borocode: Integer, Code for borough in which tree point is located

boroname: Character, Name of borough in which tree point is located

cncldist: Integer, Council district in which tree point is located

st_assem: Integer, State Assembly District in which tree point is located

st_senate: Integer, State Senate District in which tree point is located

nta: Character, This is the NTA Code corresponding to the neighborhood tabulation area from the 2010 US Census that the tree point falls into

nta_name: Character, This is the NTA name corresponding to the neighborhood tabulation area from the 2010 US Census that the tree point falls into

boro_ct: Integer, This is the boro_ct identifyer for the census tract that the tree point falls into

state: Character, All features given value ‘New York’

latitude: Numerical, Latitude of point, in decimal degrees

longitude: Numerical, Longitude of point, in decimal degrees

x_sp: Numerical, X coordinate, in state plane. Units are feet

y_sp: Numerical, Y coordinate, in state plane. Units are feet

Tree Species Reference Table

species_scientific_name: Character, The scientific latin term for the plant. Corresponds to spc_latin in tree_census table

species_common_name: Character, The commonly used name for the plant. Corresponds to spc_common in tree_census table

form: Character, Describes the rough shape of the tree

growth_rate: Character, General observation of the growth rate of the tree as compared to previous censi

fall_color: Character, The color of the foliage displayed by this tree in the autumn time

environmental_tolerances: Character, The environmental hardships that this tree is known to withstand, such as drought or highly salty soil

location_tolerances: Character, The type of location in which this tree would be suited

notes_suggested_cultivars: Character, The particular strain/subspecie of this tree that is believed most suited for street planting

tree_size: Character, The general size category of this tree at maturity (small/medium/large)

comments: Character, A comment on whether or not this species is susceptible to the Asian Long Horn Beetle, and thus inadvisable to being planted in NYC

The process used to load and merge the two datasets is as below.

tree_census = read_csv("tree_census_2015.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   tree_id = col_integer(),
##   block_id = col_integer(),
##   created_at = col_date(format = ""),
##   tree_dbh = col_integer(),
##   stump_diam = col_integer(),
##   zipcode = col_integer(),
##   cb_num = col_integer(),
##   borocode = col_integer(),
##   cncldist = col_integer(),
##   st_assem = col_integer(),
##   st_senate = col_integer(),
##   boro_ct = col_integer(),
##   latitude = col_double(),
##   longitude = col_double(),
##   x_sp = col_double(),
##   y_sp = col_double()
## )

## See spec(...) for full column specifications.

species = read_csv("tree_species.csv")

## Parsed with column specification:
## cols(
##   species_scientific_name = col_character(),
##   species_common_name = col_character(),
##   form = col_character(),
##   growth_rate = col_character(),
##   fall_color = col_character(),
##   environmental_tolerances = col_character(),
##   location_tolerances = col_character(),
##   notes_suggested_cultivars = col_character(),
##   tree_size = col_character(),
##   comments = col_character()
## )

head(tree_census)

dim(tree_census)

## [1] 683788     41

head(species)

dim(species)

## [1] 57 10

# Renaming the common fields for ease of joining
colnames(species)[1] = "spc_latin"
colnames(species)[2] = "spc_common"

# Casting all values in join columns to lower case for ease of joining
species$spc_latin = tolower(species$spc_latin)
species$spc_common = tolower(species$spc_common)
tree_census$spc_latin = tolower(tree_census$spc_latin)
tree_census$spc_common = tolower(tree_census$spc_common)

# There are ~31k rows for which the join key is missing. These are 99.99% because the trees are either dead or stumps.
tree_census %>% filter(is.na(spc_common)) %>% group_by(status) %>% summarise(n=n())

# For the above reason, a full join is used to keep the non-joinable fields from the census table in 
# the resulting table
# The spc_common field is used as a join key instead of spc_latin, because the spc_latin field appears to contain multiple slightly different versions
# of the same specie whereas the spc_common field is much more consistent.
tree = full_join(tree_census, species, by = "spc_common")

# As expected, there are 31,619 fields for which there is no join key, as also shown above
tree %>% filter(is.na(spc_common)) %>% group_by(spc_latin.x) %>% summarise(n=n())

Understand

The structure of the recast tree dataframe shows 31 factor fields, 2 integer fields, 1 date field, 4 numerical fields, 2 ordinal factor fields, 11 logical/boolean fields and 1 character field.

The tree_id and block_id fields are cast as factors, as they are not really numerical but categorical. A host of other character fields such as curb_loc, status, health, spc_latin and others are similarly recast as factors. The steward and tree_size fields are cast as ordinal factors, as they represent ordinal levels of measurement.

Although this should normally be done further along the process, some fields’ missing values must be handled at the beginning of this stage as they will be cast as factors during this step.

# The missing values of fields to be factorised must be handled prior to casting, for coding convenience.
# 7 fields are systematically missing for trees whose status is not "Alive". These are marked so.
not_applicable_cols = c("health","spc_latin.x","spc_common","steward","guards","sidewalk","problems")
tree[which(tree$status!="Alive"),not_applicable_cols] = "Not Applicable"

# Some species have been found in the census that do not exist in the reference table. These are marked.
tree[which(is.na(tree$spc_latin.y)),42:48] = "Species not found"

# Similarly, the sizes of these missing species are not available.
tree[which(is.na(tree$tree_size)),"tree_size"] = "Not Applicable"

factor_fields = c("tree_id","block_id","curb_loc","status","health","spc_latin.x","spc_common","steward","guards","sidewalk","user_type","problems","zipcode","zip_city","cb_num","borocode","boroname","cncldist","st_assem","st_senate","nta","nta_name","boro_ct","state","spc_latin.y","form","growth_rate","fall_color","environmental_tolerances","location_tolerances","notes_suggested_cultivars","comments")

tree[factor_fields] = lapply(tree[factor_fields], factor)

boolean_fields = c("root_stone","root_grate","root_other","trunk_wire","trnk_light","trnk_other","brch_light","brch_shoe","brch_other")
tree[boolean_fields] = ifelse(tree[boolean_fields]=="Yes",1,0)
tree[boolean_fields] = lapply(tree[boolean_fields], function(x) as.logical(x))

# The fields 'sidewalk' and 'comments' are actually binary true/false fields, as 'sidewalk' refers to
# whether or not the sidewalk is damaged. The 'comments' field refers to whether or not 
# the species is prone to harboring the Asian Long Horn beetle and thus not recommended for planting. 
# New binary fields are extracted from these.
tree$sidewalk_damage = ifelse(!is.na(tree["sidewalk"]),ifelse(tree["sidewalk"]=="Damage",1,0),0)
tree$beetle = ifelse(!is.na(tree["comments"]),1,0)
tree["sidewalk_damage"] = lapply(tree["sidewalk_damage"], function(x) as.logical(x))
tree["beetle"] = lapply(tree["beetle"], function(x) as.logical(x))

# The fields 'steward' and 'tree_size' should be ordinal, as they represent various gradations of measure.
stew_levels = c("<NA>","None","1or2", "3or4", "4orMore", "Med")
tree$steward = factor(tree$steward, levels=stew_levels, ordered=TRUE)

treesize_levels = c("Not Applicable","Small (Mature Height < 25 ft)","Intermediate (Mature Height 25-35 ft)","Medium (Mature Height 35-50 ft)","Large (Mature Height > 50 ft)")
tree$tree_size = factor(tree$tree_size, levels=treesize_levels, ordered=TRUE)

str(tree)[1]

## Classes 'tbl_df', 'tbl' and 'data.frame':    683810 obs. of  52 variables:
##  $ tree_id                  : Factor w/ 683788 levels "3","4","7","8",..: 98776 41298 98783 84650 518985 518990 518994 518986 518995 518991 ...
##  $ block_id                 : Factor w/ 101390 levels "100002","100003",..: 23757 23803 23757 23762 23730 23730 23730 23730 23730 23730 ...
##  $ created_at               : Date, format: "2015-08-01" "2015-07-01" ...
##  $ tree_dbh                 : int  11 4 5 2 5 5 18 20 8 20 ...
##  $ stump_diam               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ curb_loc                 : Factor w/ 2 levels "OffsetFromCurb",..: 2 2 2 2 2 2 1 1 2 1 ...
##  $ status                   : Factor w/ 3 levels "Alive","Dead",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ health                   : Factor w/ 4 levels "Fair","Good",..: 2 1 2 2 2 2 2 2 2 2 ...
##  $ spc_latin.x              : Factor w/ 133 levels "acer","acer buergerianum",..: 31 28 122 23 58 58 58 58 58 58 ...
##  $ spc_common               : Factor w/ 155 levels "'schubert' chokecherry",..: 32 5 78 124 72 72 72 72 72 72 ...
##  $ steward                  : Ord.factor w/ 6 levels "<NA>"<"None"<..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ guards                   : Factor w/ 5 levels "Harmful","Helpful",..: 2 3 2 2 3 3 3 3 2 3 ...
##  $ sidewalk                 : Factor w/ 3 levels "Damage","NoDamage",..: 1 2 2 2 1 1 2 2 1 1 ...
##  $ user_type                : Factor w/ 3 levels "NYC Parks Staff",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ problems                 : Factor w/ 233 levels "BranchLights",..: 85 47 47 47 49 49 47 62 47 47 ...
##  $ root_stone               : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ root_grate               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ root_other               : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...
##  $ trunk_wire               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ trnk_light               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ trnk_other               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ brch_light               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ brch_shoe                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ brch_other               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ address                  : chr  "63 CRANBERRY STREET" "28 OLD FULTON STREET" "45 CRANBERRY STREET" "87 COLUMBIA HEIGHTS" ...
##  $ zipcode                  : Factor w/ 191 levels "83","10001","10002",..: 99 99 99 99 99 99 99 99 99 99 ...
##  $ zip_city                 : Factor w/ 48 levels "Arverne","Astoria",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ cb_num                   : Factor w/ 59 levels "101","102","103",..: 26 26 26 26 26 26 26 26 26 26 ...
##  $ borocode                 : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ boroname                 : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ cncldist                 : Factor w/ 51 levels "1","2","3","4",..: 33 33 33 33 33 33 33 33 33 33 ...
##  $ st_assem                 : Factor w/ 65 levels "23","24","25",..: 30 30 30 30 30 30 30 30 30 30 ...
##  $ st_senate                : Factor w/ 26 levels "10","11","12",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ nta                      : Factor w/ 188 levels "BK09","BK17",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ nta_name                 : Factor w/ 188 levels "Allerton-Pelham Gardens",..: 22 22 22 22 22 22 22 22 22 22 ...
##  $ boro_ct                  : Factor w/ 2152 levels "1000201","1000202",..: 620 620 620 620 620 620 620 620 620 620 ...
##  $ state                    : Factor w/ 1 level "New York": 1 1 1 1 1 1 1 1 1 1 ...
##  $ latitude                 : num  40.7 40.7 40.7 40.7 40.7 ...
##  $ longitude                : num  -74 -74 -74 -74 -74 ...
##  $ x_sp                     : num  986271 986037 986078 985505 985455 ...
##  $ y_sp                     : num  194161 195195 194231 194307 194242 ...
##  $ spc_latin.y              : Factor w/ 58 levels "acer campestre",..: 48 7 50 5 20 20 20 20 20 20 ...
##  $ form                     : Factor w/ 5 levels "Pyramidal","Rounded",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ growth_rate              : Factor w/ 3 levels "Medium","Slow",..: 3 2 1 2 1 1 1 1 1 1 ...
##  $ fall_color               : Factor w/ 14 levels "Cream","Maroon",..: 10 14 11 7 11 11 11 11 11 11 ...
##  $ environmental_tolerances : Factor w/ 31 levels "Drought and High pH Tolerant",..: 22 21 13 25 30 30 30 30 30 30 ...
##  $ location_tolerances      : Factor w/ 9 levels "Median Tree",..: 9 7 4 7 1 1 1 1 1 1 ...
##  $ notes_suggested_cultivars: Factor w/ 46 levels "'Accolade' is Semi–double Flowering;",..: 41 37 12 5 10 10 10 10 10 10 ...
##  $ tree_size                : Ord.factor w/ 5 levels "Not Applicable"<..: 1 2 2 2 5 5 5 5 5 5 ...
##  $ comments                 : Factor w/ 1 level "Asian Long Horn Beetle Quarantine Species – Planting NOT RECOMMENDED in Brooklyn, Manhattan, Queens, and Staten Island": NA NA NA NA NA NA NA NA NA NA ...
##  $ sidewalk_damage          : logi  TRUE FALSE FALSE FALSE TRUE TRUE ...
##  $ beetle                   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

## NULL

Tidy & Manipulate Data I

According to the tidy data principle of atomicity, the ‘environmental_tolerances’, ‘location_tolerances’ and ‘fall_color’ fields should be split out into their constituent contents such that only one logical piece of information is encoded in each field. Thus, from these three fields a number of binary fields are extracted to represent all unique factors individually and their combinations when aggregated. The missing values are handled later on in the analysis.

The environment tolerances field is split into 8 binary fields, the location tolerances field into 4 fields and the fall colors field into 8 fields.

After this processing step, the data is in tidy format.

# Finding the unique values present in this column, which will be split out into other columns
unique(tree$environmental_tolerances)

##  [1] Species not found                                                  
##  [2] Shade Tolerant                                                     
##  [3] Salt, Drought, and Shade Tolerant                                  
##  [4] Wet Site and Shade Tolerant                                        
##  [5] Wet Site, Salt, Drought, High Wind, Pollution, and High pH Tolerant
##  [6] Salt, Drought, High Wind, Pollution and High pH Tolerant           
##  [7] Pollution Tolerant                                                 
##  [8] Drought, High Wind, and Pollution Tolerant                         
##  [9] Salt, Drought, Pollution, and High pH Tolerant                     
## [10] Salt and Shade Tolerant                                            
## [11] Shade and High pH Tolerant                                         
## [12] Drought, High Wind, Pollution, and High pH Tolerant                
## [13] Salt Tolerant                                                      
## [14] Wet Site Tolerant                                                  
## [15] Salt, Drought, Pollution and High pH Tolerant                      
## [16] Salt and High pH Tolerant                                          
## [17] Salt, Drought, Pollution and Shade Tolerant                        
## [18] Drought and High pH Tolerant                                       
## [19] Wet Site and Drought Tolerant                                      
## [20] Drought and Pollution Tolerant                                     
## [21] Salt and Drought Tolerant                                          
## [22] High pH Tolerant                                                   
## [23] Wet Site, Salt, Drought, Shade and High pH Tolerant                
## [24] Salt, Shade and High pH Tolerant                                   
## [25] Wet Site, Drought, High pH Tolerant                                
## [26] none                                                               
## [27] Drought Tolerant                                                   
## [28] Wet Site, Salt, Drought, High Wind, Pollution and High pH Tolerant 
## [29] Wet Site and Salt Tolerant                                         
## [30] Salt, Drought, Pollution, High pH Tolerant                         
## [31] Wet Site, Salt and High Wind Tolerant                              
## 31 Levels: Drought and High pH Tolerant ... Wet Site, Salt, Drought, Shade and High pH Tolerant

# Creating new columns based on regex match of the environmental_tolerances column
tree$env_shade = grepl("Shade",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_salt = grepl("Salt",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_drought = grepl("Drought",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_wet = grepl("Wet Site",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_wind = grepl("High Wind",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_pollution = grepl("Pollution",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_ph = grepl("High pH",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_none = grepl("none",tree$environmental_tolerances, ignore.case = TRUE)

# head of the new columns created
head(tree[,52:59])

# Finding the unique values present in this column, which will be split out into other columns
unique(tree$location_tolerances)

## [1] Species not found                                          
## [2] Small Tree Pit (<3 ft)                                     
## [3] Median Tree, Small Tree Pit (<3 ft), Narrow Growing Space  
## [4] Median Tree                                                
## [5] Median Tree, Narrow Growing Space                          
## [6] none                                                       
## [7] Median Tree, Small Tree Pit(<3ft), and Narrow Growing Space
## [8] Small Tree Pit (<3 ft), Narrow Growing Space               
## [9] Median Tree, Small Tree Pit (<3 ft)                        
## 9 Levels: Median Tree ... Species not found

# Creating new columns based on regex match of the location_tolerances column
tree$loc_small = grepl("Small Tree Pit",tree$location_tolerances, ignore.case = TRUE)
tree$loc_median = grepl("Median Tree",tree$location_tolerances, ignore.case = TRUE)
tree$loc_narrow = grepl("Narrow Growing Space",tree$location_tolerances, ignore.case = TRUE)
tree$loc_none = grepl("none",tree$location_tolerances, ignore.case = TRUE)

# head of the new columns created
head(tree[,60:63])

# The suggested cultivars field contain 46 unique values, each of which is unique to the specific tree
# hence, this field is not binarised
length(unique(tree$notes_suggested_cultivars))

## [1] 46

# The fall_color field contains color combinations of 8 individual colors, which may be useful 
# when split into their component colors. This field is binarised into 8 extra columns.
unique(tree$fall_color)

##  [1] Species not found Yellow/Orange     Yellow           
##  [4] Red/ Yellow       Maroon            Red/Bronze       
##  [7] Red               Purple/Maroon     Orange/Brown     
## [10] Yellow/Bronze     Red/Orange        Red/ yellow      
## [13] Cream             Yellow/ Bronze   
## 14 Levels: Cream Maroon Orange/Brown Purple/Maroon Red ... Yellow/Orange

# Creating binarised fields
tree$col_yellow = grepl("Yellow",tree$fall_color, ignore.case = TRUE)
tree$col_orange = grepl("Orange",tree$fall_color, ignore.case = TRUE)
tree$col_red = grepl("Red",tree$fall_color, ignore.case = TRUE)
tree$col_maroon = grepl("Maroon",tree$fall_color, ignore.case = TRUE)
tree$col_bronze = grepl("Bronze",tree$fall_color, ignore.case = TRUE)
tree$col_purple = grepl("Purple",tree$fall_color, ignore.case = TRUE)
tree$col_brown = grepl("Brown",tree$fall_color, ignore.case = TRUE)
tree$col_cream = grepl("Cream",tree$fall_color, ignore.case = TRUE)

# head of the new columns created
head(tree[,64:71])

Tidy & Manipulate Data II

Just for fun, a new field is created to find the straight-line distance from the tree to Times Square, Manhattan, in meters. Through some experimentation and research, it is found that each 0.0001 unit change in latitude corresponds to 11 meters in distance. Each 0.0001 unit change in longitude corresponds to 8 meters in distance. The latitude and longitude fields of the census data are fed into a custom function which calculates the euclidean distance between that point and the lat/long point of Times Square, which is 40.758896, -73.985130

dist_from_times_square = function(x) { 
  distance = round(sqrt((abs(40.758896-x[1])/0.0001*11)^2+(abs(-73.985130-x[2])/0.0001*8)^2),0)
  return(distance)
  }

tree$times_square = dist_from_times_square(tree[,38:39])$latitude

# As expected, the closer the borough to Manhattan, the smaller the mean distance from Times Square
tree %>% group_by(boroname) %>% summarise(mean_distance = mean(times_square, na.rm = TRUE))

Scan I

There are several fields with missing values, which are dealt with as described below.

# It can be seen that due to the full join, the 22 rows that are missing for most columns are because
# that particular species of tree was not found in the census data but kept in the joined dataframe.
# These rows are deleted.
tree = tree[-which(is.na(tree$boroname)),]
tree[which(is.na(tree$health)),"health"] = "Not Applicable"
tree[which(is.na(tree$guards)),"guards"] = "Not Applicable"
tree[which(is.na(tree$sidewalk)),"sidewalk"] = "Not Applicable"
tree[which(is.na(tree$spc_latin.x)),c("spc_latin.x","spc_common")] = "Not Applicable"
tree[which(is.na(tree$problems)),"problems"] = "Not Applicable"
tree[which(is.na(tree$steward)),"steward"] = "<NA>"

# The 'comments' field is the only one left with missing values, which have already been handled
# via the creation of the binary 'beetle' field. This field is dropped
tree = tree[,names(tree)!="comments"]

colSums(is.na(tree))

##                   tree_id                  block_id 
##                         0                         0 
##                created_at                  tree_dbh 
##                         0                         0 
##                stump_diam                  curb_loc 
##                         0                         0 
##                    status                    health 
##                         0                         0 
##               spc_latin.x                spc_common 
##                         0                         0 
##                   steward                    guards 
##                         0                         0 
##                  sidewalk                 user_type 
##                         0                         0 
##                  problems                root_stone 
##                         0                         0 
##                root_grate                root_other 
##                         0                         0 
##                trunk_wire                trnk_light 
##                         0                         0 
##                trnk_other                brch_light 
##                         0                         0 
##                 brch_shoe                brch_other 
##                         0                         0 
##                   address                   zipcode 
##                         0                         0 
##                  zip_city                    cb_num 
##                         0                         0 
##                  borocode                  boroname 
##                         0                         0 
##                  cncldist                  st_assem 
##                         0                         0 
##                 st_senate                       nta 
##                         0                         0 
##                  nta_name                   boro_ct 
##                         0                         0 
##                     state                  latitude 
##                         0                         0 
##                 longitude                      x_sp 
##                         0                         0 
##                      y_sp               spc_latin.y 
##                         0                         0 
##                      form               growth_rate 
##                         0                         0 
##                fall_color  environmental_tolerances 
##                         0                         0 
##       location_tolerances notes_suggested_cultivars 
##                         0                         0 
##                 tree_size           sidewalk_damage 
##                         0                         0 
##                    beetle                 env_shade 
##                         0                         0 
##                  env_salt               env_drought 
##                         0                         0 
##                   env_wet                  env_wind 
##                         0                         0 
##             env_pollution                    env_ph 
##                         0                         0 
##                  env_none                 loc_small 
##                         0                         0 
##                loc_median                loc_narrow 
##                         0                         0 
##                  loc_none                col_yellow 
##                         0                         0 
##                col_orange                   col_red 
##                         0                         0 
##                col_maroon                col_bronze 
##                         0                         0 
##                col_purple                 col_brown 
##                         0                         0 
##                 col_cream              times_square 
##                         0                         0

paste("There are now",sum(colSums(is.na(tree))),"missing values in total in the dataset")

## [1] "There are now 0 missing values in total in the dataset"

Scan II

There are 7 numerical fields which are scanned for outlier values.

The boxplots of outliers appear to show several variables with large numbers of outliers. However, a row will only ever have a non-zero value for either tree diameter or stump diameter as these are mutually exclusive categories of tree. The measurement taken will either be for a tree or for a stump. While these may be handled in various ways, none of them seem optimal. If the two fields were combined, this would remove one piece of information about the tree’s status (albeit it is encoded elswehere) and would also erroneously combine two different processes/populations into one. Stumps and living trees may have drastically different diameters for valid reasons. The multitude of zero values could also be set as NA’s, but this may hinder future analysis. For these reasons, the outliers will not be processed for these fields.

Similarly for the “horizontal” geolocational fields - longitude and x_sp - the outliers are likely because of the differing tree densities across the east-west plane of the city and not necessarily because of error. As the tables below show, all longitude outliers were in the borough of Staten Island and most of the times_square outliers. There is no such outlier effect in north-south geolocational fields, which also lends some credence to this hypothesis.

# tree_dbh
tree$tree_dbh %>%  boxplot(main="Box Plot of Tree Diameter", ylab="Inches", col = "grey")

# The tree diameter and stump diameter are mutually exclusive
tree %>% group_by(status) %>% summarise(stump = mean(stump_diam, na.rm = TRUE), tree = mean(tree_dbh, na.rm = TRUE))

# stump_diam
tree$stump_diam %>%  boxplot(main="Box Plot of Stump Diameter", ylab="Inches", col = "grey")

# latitude
tree$latitude %>%  boxplot(main="Box Plot of Latitude", ylab="Value", col = "grey")

# longitude
tree$longitude %>%  boxplot(main="Box Plot of Longitude", ylab="Value", col = "grey")

tree %>% group_by(boroname) %>% summarise(trees=n()/nrow(tree), lat_mean = mean(latitude), long_mean = mean(longitude))

# all outlier trees are in the borough of Staten Island
z.scores = tree$longitude %>%  scores(type = "z")
tree[which(abs(z.scores)>2),] %>% group_by(boroname) %>% summarise(n = n())

# x_sp
tree$x_sp %>%  boxplot(main="Box Plot of X Coordinate, State Plane", ylab="Feet", col = "grey")

# y_sp
tree$y_sp %>%  boxplot(main="Box Plot of Y Coordinate, State Plane", ylab="Feet", col = "grey")

# times_square
tree$times_square %>%  boxplot(main="Box Plot of Distance to Times Square", ylab="Meters", col = "grey")

# The majority of outliers are again on Staten Island. Manhattan only had 9% of trees and very sparsely
# located. The Manhattan outliers are likely lone trees that are very close to Times Square.
z.scores = tree$times_square %>%  scores(type = "z")
tree[which(abs(z.scores)>2),] %>% group_by(boroname) %>% summarise(n = n())

Transform

The times_square field is currently just a bunch of large numbers, denoting the distance in meters from Times Square. This field may be more useful if it instead represented the scale of how far it was relative to the population of other trees. In order to do this, the field is z score standardised such that the mean point is zero and the standard deviation is one.

tree$times_square = scale(tree$times_square, center = TRUE, scale = TRUE)
head(tree[,c("boroname","times_square")])