Packages needed during this preprocessing task.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(outliers)
library(forecast)
The data is first examined for understanding of the fields and their meanings. Each field is considered and the correct data type is determined. Accordingly, fields are cast in different data types. After this, the data is manipulated to fit a tidy format, whereby each row represents one observation, each column one particular variable and the intersection of these has only one atomic piece of information encoded in it. Then, a field is created out of the geolocation fields latitude and longitude, representing the distance of each tree from the central point of Times Square, Manhattan. Following this, the tidy dataset is scanned for missing values - which are duly imputed or otherwise dealt with. Then, outliers are found and considered in each numerical field. In this case, the logical reason for the outliers was quite clear. Finally, the distance from Times Square field is redefined in terms of its z score standardised values to better represent the main reason behind the creation of the field.
The dataset used in this project is from the 2015 Street Tree Census, which is conducted by volunteers organized by the New York City Department of Parks and Recreation. In this census process, 683,788 trees are inventoried by address, and identified by tree species, diameter, and condition. There are numerous other fields captured in this data, such as the exact geographic location, formal/colloquial name of the tree and category of the observer person (volunteer/NYC Parks & Rec Staff). The dataset can be found at the NYC OpenData 2015 Tree Census website. There were two tables used, one representing the information about the observed trees and another acting as a reference table of detailed information about the particular species of trees. The actual instance of the data was sourced through the Public Datasets section of Google BigQuery, as such some of the full 45 fields recorded were not present.
Each variable, its data type and a brief description is outlined below.
Tree Census Points Table
tree_id: Integer, Unique identification number for each tree point
block_id: Integer, Identifier linking each tree to the block in the blockface table/shapefile that it is mapped on
created_at: Date, The date tree points were collected in the census software
tree_dbh: Integer, Diameter of the tree, measured at approximately 54" / 137cm above the ground. This field only takes a non-zero value if the tree is of the “Alive” or “Dead” statuses, not a stump
stump_diam: Integer, Diameter of stump measured through the center, rounded to the nearest inch. This field only takes a non-zero value if the tree measured is a stump
curb_loc: Character, Location of tree bed in relationship to the curb; trees are either along the curb (OnCurb) or offset from the curb (OffsetFromCurb)
status: Character, Indicates whether the tree is alive, standing dead, or a stump
health: Character, Indicates the user’s perception of tree health
spc_latin: Character, Scientific name for species, e.g. “Acer rubrum”
spc_common: Character, Common name for species, e.g. “red maple”
steward: Character, Indicates the number of unique signs of stewardship observed for this tree. Not recorded for stumps or dead trees
guards: Character, Indicates whether a guard is present, and if the user felt it was a helpful or harmful guard. Not recorded for dead trees and stumps
sidewalk: Character, Indicates whether one of the sidewalk flags immediately adjacent to the tree was damaged, cracked, or lifted. Not recorded for dead trees and stumps
user_type: Character, This field describes the category of user who collected this tree point’s data
problems: Character, A list of all problems observed with this tree. Also represented in categorical format in the next 9 fields
root_stone: Character, Indicates the presence of a root problem caused by paving stones in tree bed
root_grate: Character, Indicates the presence of a root problem caused by metal grates in tree bed
root_other: Character, Indicates the presence of other root problems
trunk_wire: Character, Indicates the presence of a trunk problem caused by wires or rope wrapped around the trunk
trnk_light: Character, Indicates the presence of a trunk problem caused by lighting installed on the tree
trnk_other: Character, Indicates the presence of other trunk problems
brch_light: Character, Indicates the presence of a branch problem caused by lights (usually string lights) or wires in the branches
brch_shoe: Character, Indicates the presence of a branch problem caused by sneakers in the branches
brch_other: Character, Indicates the presence of other branch problems
address: Character, Nearest estimated address to tree
zipcode: Integer, Five-digit zipcode in which tree is located
zip_city: Character, City as derived from zipcode. This is often (but not always) the same as borough
cb_num: Integer, Community board in which tree point is located
borocode: Integer, Code for borough in which tree point is located
boroname: Character, Name of borough in which tree point is located
cncldist: Integer, Council district in which tree point is located
st_assem: Integer, State Assembly District in which tree point is located
st_senate: Integer, State Senate District in which tree point is located
nta: Character, This is the NTA Code corresponding to the neighborhood tabulation area from the 2010 US Census that the tree point falls into
nta_name: Character, This is the NTA name corresponding to the neighborhood tabulation area from the 2010 US Census that the tree point falls into
boro_ct: Integer, This is the boro_ct identifyer for the census tract that the tree point falls into
state: Character, All features given value ‘New York’
latitude: Numerical, Latitude of point, in decimal degrees
longitude: Numerical, Longitude of point, in decimal degrees
x_sp: Numerical, X coordinate, in state plane. Units are feet
y_sp: Numerical, Y coordinate, in state plane. Units are feet
Tree Species Reference Table
species_scientific_name: Character, The scientific latin term for the plant. Corresponds to spc_latin in tree_census table
species_common_name: Character, The commonly used name for the plant. Corresponds to spc_common in tree_census table
form: Character, Describes the rough shape of the tree
growth_rate: Character, General observation of the growth rate of the tree as compared to previous censi
fall_color: Character, The color of the foliage displayed by this tree in the autumn time
environmental_tolerances: Character, The environmental hardships that this tree is known to withstand, such as drought or highly salty soil
location_tolerances: Character, The type of location in which this tree would be suited
notes_suggested_cultivars: Character, The particular strain/subspecie of this tree that is believed most suited for street planting
tree_size: Character, The general size category of this tree at maturity (small/medium/large)
comments: Character, A comment on whether or not this species is susceptible to the Asian Long Horn Beetle, and thus inadvisable to being planted in NYC
The process used to load and merge the two datasets is as below.
tree_census = read_csv("tree_census_2015.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## tree_id = col_integer(),
## block_id = col_integer(),
## created_at = col_date(format = ""),
## tree_dbh = col_integer(),
## stump_diam = col_integer(),
## zipcode = col_integer(),
## cb_num = col_integer(),
## borocode = col_integer(),
## cncldist = col_integer(),
## st_assem = col_integer(),
## st_senate = col_integer(),
## boro_ct = col_integer(),
## latitude = col_double(),
## longitude = col_double(),
## x_sp = col_double(),
## y_sp = col_double()
## )
## See spec(...) for full column specifications.
species = read_csv("tree_species.csv")
## Parsed with column specification:
## cols(
## species_scientific_name = col_character(),
## species_common_name = col_character(),
## form = col_character(),
## growth_rate = col_character(),
## fall_color = col_character(),
## environmental_tolerances = col_character(),
## location_tolerances = col_character(),
## notes_suggested_cultivars = col_character(),
## tree_size = col_character(),
## comments = col_character()
## )
head(tree_census)
dim(tree_census)
## [1] 683788 41
head(species)
dim(species)
## [1] 57 10
# Renaming the common fields for ease of joining
colnames(species)[1] = "spc_latin"
colnames(species)[2] = "spc_common"
# Casting all values in join columns to lower case for ease of joining
species$spc_latin = tolower(species$spc_latin)
species$spc_common = tolower(species$spc_common)
tree_census$spc_latin = tolower(tree_census$spc_latin)
tree_census$spc_common = tolower(tree_census$spc_common)
# There are ~31k rows for which the join key is missing. These are 99.99% because the trees are either dead or stumps.
tree_census %>% filter(is.na(spc_common)) %>% group_by(status) %>% summarise(n=n())
# For the above reason, a full join is used to keep the non-joinable fields from the census table in
# the resulting table
# The spc_common field is used as a join key instead of spc_latin, because the spc_latin field appears to contain multiple slightly different versions
# of the same specie whereas the spc_common field is much more consistent.
tree = full_join(tree_census, species, by = "spc_common")
# As expected, there are 31,619 fields for which there is no join key, as also shown above
tree %>% filter(is.na(spc_common)) %>% group_by(spc_latin.x) %>% summarise(n=n())
The structure of the recast tree dataframe shows 31 factor fields, 2 integer fields, 1 date field, 4 numerical fields, 2 ordinal factor fields, 11 logical/boolean fields and 1 character field.
The tree_id and block_id fields are cast as factors, as they are not really numerical but categorical. A host of other character fields such as curb_loc, status, health, spc_latin and others are similarly recast as factors. The steward and tree_size fields are cast as ordinal factors, as they represent ordinal levels of measurement.
Although this should normally be done further along the process, some fields’ missing values must be handled at the beginning of this stage as they will be cast as factors during this step.
# The missing values of fields to be factorised must be handled prior to casting, for coding convenience.
# 7 fields are systematically missing for trees whose status is not "Alive". These are marked so.
not_applicable_cols = c("health","spc_latin.x","spc_common","steward","guards","sidewalk","problems")
tree[which(tree$status!="Alive"),not_applicable_cols] = "Not Applicable"
# Some species have been found in the census that do not exist in the reference table. These are marked.
tree[which(is.na(tree$spc_latin.y)),42:48] = "Species not found"
# Similarly, the sizes of these missing species are not available.
tree[which(is.na(tree$tree_size)),"tree_size"] = "Not Applicable"
factor_fields = c("tree_id","block_id","curb_loc","status","health","spc_latin.x","spc_common","steward","guards","sidewalk","user_type","problems","zipcode","zip_city","cb_num","borocode","boroname","cncldist","st_assem","st_senate","nta","nta_name","boro_ct","state","spc_latin.y","form","growth_rate","fall_color","environmental_tolerances","location_tolerances","notes_suggested_cultivars","comments")
tree[factor_fields] = lapply(tree[factor_fields], factor)
boolean_fields = c("root_stone","root_grate","root_other","trunk_wire","trnk_light","trnk_other","brch_light","brch_shoe","brch_other")
tree[boolean_fields] = ifelse(tree[boolean_fields]=="Yes",1,0)
tree[boolean_fields] = lapply(tree[boolean_fields], function(x) as.logical(x))
# The fields 'sidewalk' and 'comments' are actually binary true/false fields, as 'sidewalk' refers to
# whether or not the sidewalk is damaged. The 'comments' field refers to whether or not
# the species is prone to harboring the Asian Long Horn beetle and thus not recommended for planting.
# New binary fields are extracted from these.
tree$sidewalk_damage = ifelse(!is.na(tree["sidewalk"]),ifelse(tree["sidewalk"]=="Damage",1,0),0)
tree$beetle = ifelse(!is.na(tree["comments"]),1,0)
tree["sidewalk_damage"] = lapply(tree["sidewalk_damage"], function(x) as.logical(x))
tree["beetle"] = lapply(tree["beetle"], function(x) as.logical(x))
# The fields 'steward' and 'tree_size' should be ordinal, as they represent various gradations of measure.
stew_levels = c("<NA>","None","1or2", "3or4", "4orMore", "Med")
tree$steward = factor(tree$steward, levels=stew_levels, ordered=TRUE)
treesize_levels = c("Not Applicable","Small (Mature Height < 25 ft)","Intermediate (Mature Height 25-35 ft)","Medium (Mature Height 35-50 ft)","Large (Mature Height > 50 ft)")
tree$tree_size = factor(tree$tree_size, levels=treesize_levels, ordered=TRUE)
str(tree)[1]
## Classes 'tbl_df', 'tbl' and 'data.frame': 683810 obs. of 52 variables:
## $ tree_id : Factor w/ 683788 levels "3","4","7","8",..: 98776 41298 98783 84650 518985 518990 518994 518986 518995 518991 ...
## $ block_id : Factor w/ 101390 levels "100002","100003",..: 23757 23803 23757 23762 23730 23730 23730 23730 23730 23730 ...
## $ created_at : Date, format: "2015-08-01" "2015-07-01" ...
## $ tree_dbh : int 11 4 5 2 5 5 18 20 8 20 ...
## $ stump_diam : int 0 0 0 0 0 0 0 0 0 0 ...
## $ curb_loc : Factor w/ 2 levels "OffsetFromCurb",..: 2 2 2 2 2 2 1 1 2 1 ...
## $ status : Factor w/ 3 levels "Alive","Dead",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ health : Factor w/ 4 levels "Fair","Good",..: 2 1 2 2 2 2 2 2 2 2 ...
## $ spc_latin.x : Factor w/ 133 levels "acer","acer buergerianum",..: 31 28 122 23 58 58 58 58 58 58 ...
## $ spc_common : Factor w/ 155 levels "'schubert' chokecherry",..: 32 5 78 124 72 72 72 72 72 72 ...
## $ steward : Ord.factor w/ 6 levels "<NA>"<"None"<..: 3 3 3 3 3 3 3 3 3 3 ...
## $ guards : Factor w/ 5 levels "Harmful","Helpful",..: 2 3 2 2 3 3 3 3 2 3 ...
## $ sidewalk : Factor w/ 3 levels "Damage","NoDamage",..: 1 2 2 2 1 1 2 2 1 1 ...
## $ user_type : Factor w/ 3 levels "NYC Parks Staff",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ problems : Factor w/ 233 levels "BranchLights",..: 85 47 47 47 49 49 47 62 47 47 ...
## $ root_stone : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
## $ root_grate : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ root_other : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
## $ trunk_wire : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ trnk_light : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ trnk_other : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ brch_light : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ brch_shoe : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ brch_other : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ address : chr "63 CRANBERRY STREET" "28 OLD FULTON STREET" "45 CRANBERRY STREET" "87 COLUMBIA HEIGHTS" ...
## $ zipcode : Factor w/ 191 levels "83","10001","10002",..: 99 99 99 99 99 99 99 99 99 99 ...
## $ zip_city : Factor w/ 48 levels "Arverne","Astoria",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ cb_num : Factor w/ 59 levels "101","102","103",..: 26 26 26 26 26 26 26 26 26 26 ...
## $ borocode : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ boroname : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ cncldist : Factor w/ 51 levels "1","2","3","4",..: 33 33 33 33 33 33 33 33 33 33 ...
## $ st_assem : Factor w/ 65 levels "23","24","25",..: 30 30 30 30 30 30 30 30 30 30 ...
## $ st_senate : Factor w/ 26 levels "10","11","12",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ nta : Factor w/ 188 levels "BK09","BK17",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ nta_name : Factor w/ 188 levels "Allerton-Pelham Gardens",..: 22 22 22 22 22 22 22 22 22 22 ...
## $ boro_ct : Factor w/ 2152 levels "1000201","1000202",..: 620 620 620 620 620 620 620 620 620 620 ...
## $ state : Factor w/ 1 level "New York": 1 1 1 1 1 1 1 1 1 1 ...
## $ latitude : num 40.7 40.7 40.7 40.7 40.7 ...
## $ longitude : num -74 -74 -74 -74 -74 ...
## $ x_sp : num 986271 986037 986078 985505 985455 ...
## $ y_sp : num 194161 195195 194231 194307 194242 ...
## $ spc_latin.y : Factor w/ 58 levels "acer campestre",..: 48 7 50 5 20 20 20 20 20 20 ...
## $ form : Factor w/ 5 levels "Pyramidal","Rounded",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ growth_rate : Factor w/ 3 levels "Medium","Slow",..: 3 2 1 2 1 1 1 1 1 1 ...
## $ fall_color : Factor w/ 14 levels "Cream","Maroon",..: 10 14 11 7 11 11 11 11 11 11 ...
## $ environmental_tolerances : Factor w/ 31 levels "Drought and High pH Tolerant",..: 22 21 13 25 30 30 30 30 30 30 ...
## $ location_tolerances : Factor w/ 9 levels "Median Tree",..: 9 7 4 7 1 1 1 1 1 1 ...
## $ notes_suggested_cultivars: Factor w/ 46 levels "'Accolade' is Semi–double Flowering;",..: 41 37 12 5 10 10 10 10 10 10 ...
## $ tree_size : Ord.factor w/ 5 levels "Not Applicable"<..: 1 2 2 2 5 5 5 5 5 5 ...
## $ comments : Factor w/ 1 level "Asian Long Horn Beetle Quarantine Species – Planting NOT RECOMMENDED in Brooklyn, Manhattan, Queens, and Staten Island": NA NA NA NA NA NA NA NA NA NA ...
## $ sidewalk_damage : logi TRUE FALSE FALSE FALSE TRUE TRUE ...
## $ beetle : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## NULL
According to the tidy data principle of atomicity, the ‘environmental_tolerances’, ‘location_tolerances’ and ‘fall_color’ fields should be split out into their constituent contents such that only one logical piece of information is encoded in each field. Thus, from these three fields a number of binary fields are extracted to represent all unique factors individually and their combinations when aggregated. The missing values are handled later on in the analysis.
The environment tolerances field is split into 8 binary fields, the location tolerances field into 4 fields and the fall colors field into 8 fields.
After this processing step, the data is in tidy format.
# Finding the unique values present in this column, which will be split out into other columns
unique(tree$environmental_tolerances)
## [1] Species not found
## [2] Shade Tolerant
## [3] Salt, Drought, and Shade Tolerant
## [4] Wet Site and Shade Tolerant
## [5] Wet Site, Salt, Drought, High Wind, Pollution, and High pH Tolerant
## [6] Salt, Drought, High Wind, Pollution and High pH Tolerant
## [7] Pollution Tolerant
## [8] Drought, High Wind, and Pollution Tolerant
## [9] Salt, Drought, Pollution, and High pH Tolerant
## [10] Salt and Shade Tolerant
## [11] Shade and High pH Tolerant
## [12] Drought, High Wind, Pollution, and High pH Tolerant
## [13] Salt Tolerant
## [14] Wet Site Tolerant
## [15] Salt, Drought, Pollution and High pH Tolerant
## [16] Salt and High pH Tolerant
## [17] Salt, Drought, Pollution and Shade Tolerant
## [18] Drought and High pH Tolerant
## [19] Wet Site and Drought Tolerant
## [20] Drought and Pollution Tolerant
## [21] Salt and Drought Tolerant
## [22] High pH Tolerant
## [23] Wet Site, Salt, Drought, Shade and High pH Tolerant
## [24] Salt, Shade and High pH Tolerant
## [25] Wet Site, Drought, High pH Tolerant
## [26] none
## [27] Drought Tolerant
## [28] Wet Site, Salt, Drought, High Wind, Pollution and High pH Tolerant
## [29] Wet Site and Salt Tolerant
## [30] Salt, Drought, Pollution, High pH Tolerant
## [31] Wet Site, Salt and High Wind Tolerant
## 31 Levels: Drought and High pH Tolerant ... Wet Site, Salt, Drought, Shade and High pH Tolerant
# Creating new columns based on regex match of the environmental_tolerances column
tree$env_shade = grepl("Shade",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_salt = grepl("Salt",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_drought = grepl("Drought",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_wet = grepl("Wet Site",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_wind = grepl("High Wind",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_pollution = grepl("Pollution",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_ph = grepl("High pH",tree$environmental_tolerances, ignore.case = TRUE)
tree$env_none = grepl("none",tree$environmental_tolerances, ignore.case = TRUE)
# head of the new columns created
head(tree[,52:59])
# Finding the unique values present in this column, which will be split out into other columns
unique(tree$location_tolerances)
## [1] Species not found
## [2] Small Tree Pit (<3 ft)
## [3] Median Tree, Small Tree Pit (<3 ft), Narrow Growing Space
## [4] Median Tree
## [5] Median Tree, Narrow Growing Space
## [6] none
## [7] Median Tree, Small Tree Pit(<3ft), and Narrow Growing Space
## [8] Small Tree Pit (<3 ft), Narrow Growing Space
## [9] Median Tree, Small Tree Pit (<3 ft)
## 9 Levels: Median Tree ... Species not found
# Creating new columns based on regex match of the location_tolerances column
tree$loc_small = grepl("Small Tree Pit",tree$location_tolerances, ignore.case = TRUE)
tree$loc_median = grepl("Median Tree",tree$location_tolerances, ignore.case = TRUE)
tree$loc_narrow = grepl("Narrow Growing Space",tree$location_tolerances, ignore.case = TRUE)
tree$loc_none = grepl("none",tree$location_tolerances, ignore.case = TRUE)
# head of the new columns created
head(tree[,60:63])
# The suggested cultivars field contain 46 unique values, each of which is unique to the specific tree
# hence, this field is not binarised
length(unique(tree$notes_suggested_cultivars))
## [1] 46
# The fall_color field contains color combinations of 8 individual colors, which may be useful
# when split into their component colors. This field is binarised into 8 extra columns.
unique(tree$fall_color)
## [1] Species not found Yellow/Orange Yellow
## [4] Red/ Yellow Maroon Red/Bronze
## [7] Red Purple/Maroon Orange/Brown
## [10] Yellow/Bronze Red/Orange Red/ yellow
## [13] Cream Yellow/ Bronze
## 14 Levels: Cream Maroon Orange/Brown Purple/Maroon Red ... Yellow/Orange
# Creating binarised fields
tree$col_yellow = grepl("Yellow",tree$fall_color, ignore.case = TRUE)
tree$col_orange = grepl("Orange",tree$fall_color, ignore.case = TRUE)
tree$col_red = grepl("Red",tree$fall_color, ignore.case = TRUE)
tree$col_maroon = grepl("Maroon",tree$fall_color, ignore.case = TRUE)
tree$col_bronze = grepl("Bronze",tree$fall_color, ignore.case = TRUE)
tree$col_purple = grepl("Purple",tree$fall_color, ignore.case = TRUE)
tree$col_brown = grepl("Brown",tree$fall_color, ignore.case = TRUE)
tree$col_cream = grepl("Cream",tree$fall_color, ignore.case = TRUE)
# head of the new columns created
head(tree[,64:71])
Just for fun, a new field is created to find the straight-line distance from the tree to Times Square, Manhattan, in meters. Through some experimentation and research, it is found that each 0.0001 unit change in latitude corresponds to 11 meters in distance. Each 0.0001 unit change in longitude corresponds to 8 meters in distance. The latitude and longitude fields of the census data are fed into a custom function which calculates the euclidean distance between that point and the lat/long point of Times Square, which is 40.758896, -73.985130
dist_from_times_square = function(x) {
distance = round(sqrt((abs(40.758896-x[1])/0.0001*11)^2+(abs(-73.985130-x[2])/0.0001*8)^2),0)
return(distance)
}
tree$times_square = dist_from_times_square(tree[,38:39])$latitude
# As expected, the closer the borough to Manhattan, the smaller the mean distance from Times Square
tree %>% group_by(boroname) %>% summarise(mean_distance = mean(times_square, na.rm = TRUE))
There are several fields with missing values, which are dealt with as described below.
# It can be seen that due to the full join, the 22 rows that are missing for most columns are because
# that particular species of tree was not found in the census data but kept in the joined dataframe.
# These rows are deleted.
tree = tree[-which(is.na(tree$boroname)),]
tree[which(is.na(tree$health)),"health"] = "Not Applicable"
tree[which(is.na(tree$guards)),"guards"] = "Not Applicable"
tree[which(is.na(tree$sidewalk)),"sidewalk"] = "Not Applicable"
tree[which(is.na(tree$spc_latin.x)),c("spc_latin.x","spc_common")] = "Not Applicable"
tree[which(is.na(tree$problems)),"problems"] = "Not Applicable"
tree[which(is.na(tree$steward)),"steward"] = "<NA>"
# The 'comments' field is the only one left with missing values, which have already been handled
# via the creation of the binary 'beetle' field. This field is dropped
tree = tree[,names(tree)!="comments"]
colSums(is.na(tree))
## tree_id block_id
## 0 0
## created_at tree_dbh
## 0 0
## stump_diam curb_loc
## 0 0
## status health
## 0 0
## spc_latin.x spc_common
## 0 0
## steward guards
## 0 0
## sidewalk user_type
## 0 0
## problems root_stone
## 0 0
## root_grate root_other
## 0 0
## trunk_wire trnk_light
## 0 0
## trnk_other brch_light
## 0 0
## brch_shoe brch_other
## 0 0
## address zipcode
## 0 0
## zip_city cb_num
## 0 0
## borocode boroname
## 0 0
## cncldist st_assem
## 0 0
## st_senate nta
## 0 0
## nta_name boro_ct
## 0 0
## state latitude
## 0 0
## longitude x_sp
## 0 0
## y_sp spc_latin.y
## 0 0
## form growth_rate
## 0 0
## fall_color environmental_tolerances
## 0 0
## location_tolerances notes_suggested_cultivars
## 0 0
## tree_size sidewalk_damage
## 0 0
## beetle env_shade
## 0 0
## env_salt env_drought
## 0 0
## env_wet env_wind
## 0 0
## env_pollution env_ph
## 0 0
## env_none loc_small
## 0 0
## loc_median loc_narrow
## 0 0
## loc_none col_yellow
## 0 0
## col_orange col_red
## 0 0
## col_maroon col_bronze
## 0 0
## col_purple col_brown
## 0 0
## col_cream times_square
## 0 0
paste("There are now",sum(colSums(is.na(tree))),"missing values in total in the dataset")
## [1] "There are now 0 missing values in total in the dataset"
There are 7 numerical fields which are scanned for outlier values.
The boxplots of outliers appear to show several variables with large numbers of outliers. However, a row will only ever have a non-zero value for either tree diameter or stump diameter as these are mutually exclusive categories of tree. The measurement taken will either be for a tree or for a stump. While these may be handled in various ways, none of them seem optimal. If the two fields were combined, this would remove one piece of information about the tree’s status (albeit it is encoded elswehere) and would also erroneously combine two different processes/populations into one. Stumps and living trees may have drastically different diameters for valid reasons. The multitude of zero values could also be set as NA’s, but this may hinder future analysis. For these reasons, the outliers will not be processed for these fields.
Similarly for the “horizontal” geolocational fields - longitude and x_sp - the outliers are likely because of the differing tree densities across the east-west plane of the city and not necessarily because of error. As the tables below show, all longitude outliers were in the borough of Staten Island and most of the times_square outliers. There is no such outlier effect in north-south geolocational fields, which also lends some credence to this hypothesis.
# tree_dbh
tree$tree_dbh %>% boxplot(main="Box Plot of Tree Diameter", ylab="Inches", col = "grey")
# The tree diameter and stump diameter are mutually exclusive
tree %>% group_by(status) %>% summarise(stump = mean(stump_diam, na.rm = TRUE), tree = mean(tree_dbh, na.rm = TRUE))
# stump_diam
tree$stump_diam %>% boxplot(main="Box Plot of Stump Diameter", ylab="Inches", col = "grey")
# latitude
tree$latitude %>% boxplot(main="Box Plot of Latitude", ylab="Value", col = "grey")
# longitude
tree$longitude %>% boxplot(main="Box Plot of Longitude", ylab="Value", col = "grey")
tree %>% group_by(boroname) %>% summarise(trees=n()/nrow(tree), lat_mean = mean(latitude), long_mean = mean(longitude))
# all outlier trees are in the borough of Staten Island
z.scores = tree$longitude %>% scores(type = "z")
tree[which(abs(z.scores)>2),] %>% group_by(boroname) %>% summarise(n = n())
# x_sp
tree$x_sp %>% boxplot(main="Box Plot of X Coordinate, State Plane", ylab="Feet", col = "grey")
# y_sp
tree$y_sp %>% boxplot(main="Box Plot of Y Coordinate, State Plane", ylab="Feet", col = "grey")
# times_square
tree$times_square %>% boxplot(main="Box Plot of Distance to Times Square", ylab="Meters", col = "grey")
# The majority of outliers are again on Staten Island. Manhattan only had 9% of trees and very sparsely
# located. The Manhattan outliers are likely lone trees that are very close to Times Square.
z.scores = tree$times_square %>% scores(type = "z")
tree[which(abs(z.scores)>2),] %>% group_by(boroname) %>% summarise(n = n())
The times_square field is currently just a bunch of large numbers, denoting the distance in meters from Times Square. This field may be more useful if it instead represented the scale of how far it was relative to the population of other trees. In order to do this, the field is z score standardised such that the mean point is zero and the standard deviation is one.
tree$times_square = scale(tree$times_square, center = TRUE, scale = TRUE)
head(tree[,c("boroname","times_square")])