Executive Summary

This report describes a data pre-processing exercise using open data. The goal of the exercise is to import, inspect, and manipulate data according to the data preprocessing pipeline.

Get - Two open datasets were downloaded as .csv files and imported to R for processing. The data was filtered to include a subset of variables, and summarised with a short description. The tables were then joined using a common key created from other columns.
Understand - Data types were checked and converted where necessary, and variable names were changed for interpretability.
Tidy & Manipulate – New variables were created to correct those that violated the principles of tidy data, specifically the one that states that each column should contain only one variable. New variables were created by manipulating existing ones in preparation for further analysis.
Scan – The data were scanned for missing values, with substitutions and imputations applied as required. The data were also scanned for outliers that may require correction or transformation.
Transform – Logarithmic transformation and equal-frequency binning of select numeric variables were applied to aid in future analysis.

The desired output of the data pre-processing is a tidy dataset that combines data on street trees in New York City and US Census estimates of median household income, which could be used in a future analysis of factors affecting tree health and community stewardship.

This report was compiled as a course assignment for MATH2349 Data Wrangling at the Royal Melbourne Institute of Technology (RMIT). It represents my own work based on data sourced from the City of New York and the United States Census Bureau and should not be used or reproduced for any other purpose.

Required packages

All analysis was conducted using R version 3.5.1 (2018-07-02) in Rstudio (1.1.456) on Windows 10 Enterprise (1803). The following packages are required to reproduce the analysis: tidyverse (1.2.1), readxl (1.2.0), lubridate (1.7.4), mlr (2.13), Hmisc (4.2-0), editrules (2.9.3), deducorrect (1.3.7), infotheo (1.2.0), forecast(8.7), and cowplot (0.9.3).

library(tidyverse)
library(readxl)
library(lubridate)
library(mlr)
library(Hmisc)
library(editrules)
library(deducorrect)
library(infotheo)
library(forecast)
library(cowplot)

Data

The goal of this exercise is to combine data about New York City’s street trees with income data about it’s residents. It brings together data from two different sources to create a proof-of-concept anaylsis dataset that could inform such a study in the future.

NYC Street Tree Census

The New York City Department of Parks & Recreation (NYC Parks) conducts a citizen census of all of it’s street trees every five years. The program started in 1995, and the last census was in 2015. Volunteers recieve extensive training from NYC Parks and assistance from ‘tree guides’ to prepare the the census. NYC Parks also worked with a group called TreeKIT and partners OpenPlan and CartoDB to develop a mobile data entry tool that simplifies the data collection process. In the last census, 2,200 volunteers mapped and measured 666,134 street trees over 131,488 blocks, walking over 11,093 miles in the process (NYC 2020).

The data collected by the volunteers is collated and used to inform a data-driven approach to urban tree management by the city. The work is published in the interactive the “2015 Street Tree Census Report” and “NYC Street Tree Map” available to the public on the NYC Parks website. It also infomrs specific urban forest summaries for Boroughs, community boards, councili districts, and neighbourhoods used by local comunity groups to inform their own advocacy and decision making. The data from the study is published on the New York City Open Data Portal in a variety of formats. This study uses a .csv file downloaded locally and loaded into R using the base read.csv() fuction. A comprehensive data dictionary with descriptions of the columns in the table is available alongside the dataset.

ACS Median Household Income

The other data used for this anaysis relates to 5-year median household income estimates by US Census tract, sourced from the United States Census Bureau open data portal. The data is collected as part of the American Community Survey (ACS), a continuous study by the Bureau that produces annual data and reporting across a wide range of demographic and ecnonmic topics. The results are statistically determined based on representive samples of the population, which is different then the decennial census. The result are nevertheless useful in guiding the allocation of US$675 billion in federal and state funds. It is also used by policy profressionals, elected officials, researchers, and community groups to inform policy and decision making across a range of topics.

That US Census Bureau open data portal includes an interactive search and collation tool that allows users to specify the data tables, levels of geography, and format of datasets for download. The data for this report comes from the table B19013: “MEDIAN INCOME IN THE PAST 12 MONTHS (IN 2015 INFLATION-ADJUSTED DOLLARS)”, which matches the timeframe of the latest NYC Street Tree Census. The data for every census tract in New York State was downloaded as a .csv file, along with accompanying notes and metadata, and imported into R using the base read.csv() funtion. The Median Household Income dataset includes two ID variables and data on estimated median household income and the estimation range (plus/minus) from the statistical process.

# Street Trees Dataset
# https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh
raw_trees  <- read.csv("2015_Street_Tree_Census_-_Tree_Data.csv", 
                       stringsAsFactors = FALSE, na.strings = "")

# Street Trees Data Dictionary
# https://data.cityofnewyork.us/api/views/uvpi-gqnh/files/8705bfd6-993c-40c5-8620-0c81191c7e25?download=true&filename=StreetTreeCensus2015TreesDataDictionary20161102.pdf

# Select variables of interest for processing
proc_trees <- raw_trees %>%
  select(tree_id,                            # Primary key
         created_at,                         # Date
         tree_dbh, stump_diam,               # Integer
         spc_latin, spc_common,              # Character
         curb_loc, guards, sidewalk, status, # Factors
         health, steward,                    # Ordered factors
         problems,                           # Multi-text
         latitude, longitude, x_sp, y_sp,    # Numeric
         boro_ct)                            # Foreign key

# Median HH Income Dataset
# https://data.census.gov/cedsci/table?q=B19013%3A%20MEDIAN%20HOUSEHOLD%20INCOME%20IN%20THE%20PAST%2012%20MONTHS%20%28IN%202018%20INFLATION-ADJUSTED%20DOLLARS%29&hidePreview=true&tid=ACSDT1Y2018.B19013
filepath <- "ACSDT5Y2015.B19013_2020-06-01T065604"
filename <- "ACSDT5Y2015.B19013_data_with_overlays_2020-06-01T065600.csv"
raw_income <- read.csv(paste(filepath,filename,sep="/"), 
                       stringsAsFactors = FALSE, na.strings = "null")
raw_income <- raw_income[-1,]

# Guide to GEO_ID Census codes
# https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html

# NYC Census Tract Definitions
# https://data.cityofnewyork.us/City-Government/2010-Census-Tracts/fxpq-c8ku
geography <- read.csv("nyct2010.csv", stringsAsFactors = FALSE)

The NYC Street Trees dataset contains 683,788 rows and 45 columns. Each record represents one tree counted, and is described by a unique key called ‘tree_id’. A full accounting of the variables in the dataset can be found in the data dictionary provided by NYC Parks on the data portal. The following is a brief summary of the type of data it contains:

Summary of NYC Street Trees Dataset
Type of Information	Columns
Unique Identifier	tree_id
Date or Record	created_at
Tree Measurements	tree_dbh, stump_diam
Tree Condition	status, health, problems
Tree Species	spc_latin, spc_common
Local Context	block_id, curb_loc, steward, guards, sidwalk
Observations (roots)	root_stone, root_grate, root_other, trnk_wire, trnk_light, trnk_other, brch_light, brch_shoe, brch_other
Location (area)	address, state, postcode, zip_city, community.board, borocode, borough, cncldist, council.distric, st_assem, st_senate, nta, nta_name, boro_ct, census.tract
Location (point)	latitude, longitude, x_sp, y_sp

The ACS Median Household Income table contains 4,918 rows and 4 columns. Each record represents one census tract in New York State, and is described by a unique key called ‘GEO_ID’. A full accounting of the variables in the dataset can be found in the data notes provided by the US Census Bureau on the data portal. The following is a summary of the data it contains:

Summary of ACS Median Household Income Dataset
Type of Information	Columns
Unique Identifier	GEO_ID
Location (area)	NAME
Median Income Estimate	B19013_001E
Income Margin of Error	B19013_001M

The NYC Street Tree data includes a number of geographical identifiers that can be used to match it with spatial layers and other datasets, in addition to two types of coordinates (spherical latitude/longitude and state plane x/y) and closest street address. These include the name and numeric code for five New York City boroughs of Manhattan, the Bronx, Brooklyn, Queens, and Staten Island; postcode (i.e. zip code), community board, City Council District, State Assembly District, State Senate District, Neighbourhood Tabulation Areas (NTA), US Census Tract, and unique numeric ID called ‘boro_ct’ developed by the City government for a host of mapping and matching purposes.

The ‘boro_ct’ is the key used for matching data between the two datasets, as it is in a more consistent format than the ‘census.tract’ variable in the street trees dataset. It is a seven digit code that appends the single-digit ‘borocode’ to the begining of the six-digit Census Tract identifier. (Census tracts consist of a four digit number with two optional decimal places, which are represented in the ‘boro_ct’ as six digits omitting the decimal.) For example, the ‘borocode’ value ‘1013300’ represents “Census Tract 133, New York County, New York” in the Borough of Manhattan. (Note that Census Tract numbers (e.g. 133) are unique within a county, but can be reused in other counties. Therefore additional identifier is needed to match data across multiple counties or larger geographies.)

The ‘boro_code_ct’ field does not exist in the ACS Median Household Income dataset, but it can be derived from the information in other fields. The ‘GEO_ID’ is a US Census identifier that contains information about multiple nested geographical divisions. For example, the ‘GEO_ID’ value 1400000US36061013300 represents the same Census tract in the above example. The digits after the ‘US’ in the string correspond to state (36 = New York), county (061 = New York County), and Census Tract with suffix (013300 = Census Tract 133).

These are the steps used to construct it in order to create a key that joins the two datasets:

The ACS Median Household Income dataset is filtered to only include the five counties that make up New York City (and that correspond to the five boroughs). This is achieved by searching the ‘NAME’ variable for one of five unique text strings: “New York County” (for Mahattan) “Bronx” OR “Kings” (i.e. Brooklyn) OR “Queens” OR “Richmond” (i.e. Staten Island)
A new paceholder variable called ‘borocode’ is created using a rule that matches these same text strings from the ‘NAME’ variable to the corresponding numeric code (e.g. “Bronx” = 2).
A second new variable is created that appends this ‘borocode’ value to the last six digits extracted from the ‘GEO_ID’ code variable.
The ‘borocode’ variable is deleted, as it is no longer required.

The two datasets are then merged using a left join to preserve every row in the trees dataset and append the median household income and estmation range. The key used to join the two is the ‘boro_ct’ column. It is the primary key in the ACS Median Household Income dataset, since it is the unique identifier for each census tract. It is a foreign key in the NYC Street Trees dataset, as many trees fall into each census tract. Therefore the merge represents a one to many join.

# Remove second header row from first row of data and 
# extract census tract number from the full NAME
county_filter <- "Bronx|Kings|Queens|Richmond|New York County"
proc_income <- raw_income %>%
  # Subset only counties within New York City
  filter(str_detect(NAME, county_filter)) %>%
  # Create 'boro_ct' foreign key: code borough number from county name...
  mutate(borocode = case_when(str_detect(NAME, "New York County") ~ 1,
                              str_detect(NAME, "Bronx"          ) ~ 2,
                              str_detect(NAME, "Kings"          ) ~ 3,
                              str_detect(NAME, "Queens"         ) ~ 4,
                              str_detect(NAME, "Richmond"       ) ~ 5,
                              TRUE ~ 99),
         # ...then append census tract from last six digits of full GEO_ID
         boro_ct = as.integer(paste0(borocode, str_sub(GEO_ID, -6)))) %>%
  # Remove placeholder variable for 'borocode' no longer required
  select(-borocode)

# Join NYC Street Trees data to ACS Median Household Income data
combined_data <- left_join(proc_trees, proc_income, by = "boro_ct") %>% 
  select(-GEO_ID, -NAME)

combined_data %>% head()

Understand

The variables in the dataset are imported as one of three types: integer, numeric, or character. However this does not always match the desired type. The following conversions are carried out before examining the data:

The ‘created_at’ column holds the date that the record was entered into the dataset, but it is stored as a character in MM-DD-YYYY format in the .csv file. The dmy() function from the lubridate package is used to parse the dates to a “Date” format.
There are four different variables stored as text are restricted to only a few values by the data collection rules. The following are converted to factors using the factor() function from base R:
- curb_loc: Specifies where a tree sits within the street corridor. It can take the values of “OffsetFromCurb” or “OnCurb”.
- status: The top level of multiple variables related to tree condition. It can take the values of “Alive”, “Dead”, or “Stump”.
- guards: Notes whether the tree has protective guards, and if so, their estimated effect on the tree. It can take the values of “Harmful”, “Helpful”, “None”, or “Unsure”.
- sidewalk: Notes whether there is visible from the sidewalk from tree roots. It can take the values of “Damage” or “NoDamage”.
Two other variables are also restricted to only a few values, but those values can be put in a logical order. The follwoing are converted to ordered factors using the ordered() function from base R:
- health: The second level of multiple variables related to tree condition. It can take the ordered values of “Poor” < “Fair” < “Good”.
- steward: Notes how many signs of community care or stewardship are present, such as a community-installed guard. It can take the ordered values of “None” < “1or2” < “3or4” < “4orMore”. (Note that it is ambiguous if the number of signs equals four. For the purposes of this analysis, the top category in the ordered factor will be assumed to mean “more than 4”.)
The median income and margin of error columns are imported from the .csv as character variables, because they contain some text values with specific meanings from the analysis. These values are described in the data note, but are not required for this analysis. They are either coded as numbers or treated as missing values. For example:
- The income level “250,000+” is coded as the number 250,000. Similarly, the margin of error value of “***" indicates that the income value in question is in the upper range of an open-ended distribution. It is coded as the number 0. Although this loses some information about the range of the upper level income bracket, it is kept as 0 rather than treating it as a missing value so census tracts with a median household income of US$250,000 remain in the dataset. They could easily be excluded if later determined to be a better approach for future analysis.
- The income level value “-” indicates that there were not enough respondents to calculate a statistical estimate, as does the margin of error value of “**“. Both of these values are coded as missing values (NA) in their respective columns.
Once the text characters are coded as either numbers or NA, the median income and margin or error columns are converted to numeric format using the as.integer() function in base R. Had this conversion been done prior to Step 4, the text values would have all been coersed to NA. Going through each text exception helped catch the “250,000+” text that was not mentioned in the data note.
Finally, the median income and margin or error columns were renamed with more human-readable column names using the rename() function from the dplyr package:
- B19013_001M: was renamed to ‘Median_HH_Income’.
- B19013_001E: was renamed to ‘PlusMinus_HH_Income’.

# Convert date field to date data type
formatted_data <- combined_data %>%
  mutate(# Dates
         created_at = mdy(created_at),
         # Regular Factors
         curb_loc = factor(curb_loc),
         status   = factor(status),
         guards   = factor(guards),
         sidewalk = factor(sidewalk),
         # Ordered Factors
         health  = ordered(health, levels = c("Poor","Fair","Good")),
         steward = ordered(steward, levels = c("None","1or2","3or4","4orMore")),
         # Replace placeholders that signify open ended estimate
         B19013_001E = replace(B19013_001E, B19013_001E == "250,000+", 250000),
         B19013_001M = replace(B19013_001M, B19013_001M == "***", 0),
         # Replace placeholders that signify sample size is too small
         B19013_001E = replace(B19013_001E, B19013_001E == "-", NA),
         B19013_001M = replace(B19013_001M, B19013_001M == "**", NA),
         # Coerce income values to integers - check no coersion warnings
         B19013_001E = as.integer(B19013_001E),
         B19013_001M = as.integer(B19013_001M)) %>%
  # Rename income variable to something human readable
  rename(Median_HH_Income    = B19013_001E,
         PlusMinus_HH_Income = B19013_001M)

formatted_data %>% select(Median_HH_Income, PlusMinus_HH_Income) %>% summary()

##  Median_HH_Income PlusMinus_HH_Income
##  Min.   :  9829   Min.   :     0     
##  1st Qu.: 44911   1st Qu.:  7828     
##  Median : 63156   Median : 11658     
##  Mean   : 65407   Mean   : 13065     
##  3rd Qu.: 82927   3rd Qu.: 16375     
##  Max.   :250000   Max.   :118971     
##  NA's   :8181     NA's   :8181

# Check updated data structures         
features <- names(formatted_data)
classes <- sapply(formatted_data, class)

The following is a brief summary of the variables with their resulting class types after conversion:

Data Types in Combined Dataset
Variable	Class	Description
tree_id	integer	Unique identification number for each tree point
created_at	Date	The date tree points were collected in the census software
tree_dbh	integer	Diameter of the tree in inches, measured at approximately 54in / 137cm above the ground
stump_diam	integer	Diameter of stump measured through the center, rounded to the nearest inch
spc_latin	character	Scientific name for species
spc_common	character	Common name for species
curb_loc	factor	Location of tree bed in relationship to the curb
guards	factor	Indicates whether a guard is present, and whether it is a helpful or harmful guard
sidewalk	factor	Indicates whether one of the sidewalk flags immediately adjacent to the tree was damaged
status	factor	Indicates whether the tree is alive, standing dead, or a stump
health	ordered, factor	Indicates the level of health (of live trees only)
steward	ordered, factor	Indicates the number of unique signs of stewardship observed
problems	character	Indicates the presence of one or more problems with the tree, from a predefined list
latitude	numeric	Latitude of point, in decimal degrees (WGS84)
longitude	numeric	Longitude of point, in decimal degrees (WGS84)
x_sp	numeric	X coordinate of point, in feet (State Plane)
y_sp	numeric	Y coordinate of point, in feet (State Plane)
boro_ct	integer	Identifyer for the census tract that the tree point falls into (NYC custom format)
Median_HH_Income	integer	Estimated median household income for the census tract (from ACS)
PlusMinus_HH_Income	integer	Margin or error for median houehold income estimate (from ACS)

Tidy & Manipulate Data I

One of the variables in the NYC Street Trees dataset contains multiple categories, which violates Hadley Wickham’s first principle of tidy data that “Each variable must have its own column”. The ‘problems’ variable stores a list of the problems identified for each tree, or “None” if none were observed. These problems are selected from a list and therefore limited to nine possible values: three each for problems related to a tree’s roots, trunk, and branches.

A better way to store this data is in nine new binary (or logical) columns containing either “Yes” or “No” (or TRUE or FALSE) for each possible problem. In fact, to the NYC Park’s credit, they do provide exactly these nine columns next to the ‘problems’ column in the .csv file. But for the purposes of this exercise they are derived from the original ‘untidy’ column, which is then discared. That is done according to the following steps:

The text in the ‘problems’ column is broken out using the separate_rows() function of the tidyr package, which parses the text based on the presence of a delimiter character. In this case, the default delimiter “,” is suitable. This function also creates a duplicate row for each multiple value in the ‘problems’ column. For example, a tree with three problems listed in the ‘problems’ column will be duplicated twice, for a total of three new rows in the resulting dataset.
The data is now in a ‘long’ format, and can be reshaped to a ‘wide’ format using the spread() function in the tidyr package. However, this function requires both a ‘key’ and a ‘value’ column, which aren’t present. The newly duplicated ‘problems’ variable serves as the key, whose values will become the new variable names. So a new column is added called ‘placeholder’ that is set to a text value “Yes” that will indicate the problem is present.
Then, the spread() function is used with ‘problems’ as the key and ‘placeholder’ as the value arguments passed to it. An additional argument of ‘fill = “No”’ populates missing values - where a specific problem is not identified or a tree - with that value.
Finally, this process creates a redundant column for the value “None” in the original ‘problems’ column. This acted as missing value string in that format, which is preferable to recording a missing observation that could be intepreted as an omission in the data collection process. In this new format, a tree with no problems identified will have the value “No” in each problem type. Since the ‘None’ column is no longer needed, it is removed.

# The problems variable contains multiple values (untidy!)
# So spread the values in the problems field into multple binary columns
wide_data <- formatted_data %>% 
  separate_rows(problems, sep = ",") %>% 
  mutate(problems = replace(problems, is.na(problems), "Unknown"),
         placeholder = "Yes") %>%
  spread(key = problems, value = placeholder, fill = "No") %>%
  select(-None) %>%
  mutate_at(vars(BranchLights:WiresRope), factor)

wide_data %>% head()

Tidy & Manipulate Data II

The foreign key ‘boro_ct’ was created from component parts of the ‘GEO_ID’ and ‘NAME’ variables prior to merging the datasets. Five more variables are created here using calculations and combinations of other existing variables:

Two variables called ‘Income_Upper’ and ‘Income_Lower’ are calculated by adding and subtracting the value of the margin or error to and from the median income variable. This provides the bounds of the estimated income range which could be used for future analysis.
A new variable called ‘diam_cm’ combines the tree diameter measurements from the ‘tree_dbh’ and ‘stump_diam’ variables in the original data, and coverts the imperial units (in) to metric (cm). In the original data, the ‘tree_dbh’ variable is equal to 0 when the ‘status’ variable is “Stump”, and the ‘stump_diam’ variable is NA when the ‘status’ variable is either “Alive” or “Dead”. There are no instances of a positive value in both columns. So a single diameter value (regardless of status) can be calculated by summing the two columns, and the conversion to metric is simple multiplication.
A new variable called ‘condition’ is created by combining the values in the ‘health’ and ‘status’ columns. The ‘health’ variable is an ordered factor that stores a value for live trees. Those trees where ‘status’ is either “Stump” or “Dead” have a missing value (NA) in the ‘health’ column. These two can be combined using the coalesce() function in dyplyr, which replicates a similar function in SQL. The missing values in the first column passed to coalesce() are replaced by the values in the second column passed as an argument. Note that the columns be be converted from factors to columns before then can be combined. Then new ordered factor levels are added to the combined ‘condition’ variable. The now redundant ‘status’ and ‘health’ columns are left in the dataset for now, as they will be needed in the scan for missing variables. They will be dropped from the dataset once the ‘condition’ column is deemed to be complete.

# This is the R chunk for the Tidy & Manipulate Data II 
full_data <- wide_data %>%
  mutate(Income_Upper = Median_HH_Income + PlusMinus_HH_Income,
         Income_Lower = Median_HH_Income + PlusMinus_HH_Income,
         diam_cm = (2.54 * (tree_dbh + stump_diam)),
         condition = coalesce(as.character(health), as.character(status)),
         condition = ordered(condition, levels = c("Stump","Dead","Poor","Fair","Good"))) 

full_data %>% head()

Scan I: Missing Values and Obvious Errors

Missing Values

Of the 683,788 total records, 644,461 are complete. This represents 94.2 % of the full dataset. The chart below shows the incidence of missing values (NA) in each of the columns in the full dataset. The bars show the percent incomplete and the labels show the actual number of missing values in each column.

dead_or_stump_mask <- ((full_data$status == "Stump") | (full_data$status == "Dead"))
dead_or_stump      <- sum(dead_or_stump_mask)        # 31,615

# Define Custom Funcitons
na_cols <- function(data){
  perc.na <- data %>% is.na() %>% colMeans() %>% round(3)
  numb.na <- data %>% is.na() %>% colSums()
  labels  <- numb.na #paste(perc.na*100,"%")
  labels[numb.na == 0] <- "OK"
  columns <- data %>% colnames()
  na.cols <- data.frame(columns, perc.na, labels, numb.na)
  ggplot(na.cols, aes(x=reorder(columns,-numb.na), y=perc.na)) + 
    geom_bar(stat = "identity") + ylim(0,1) +
    geom_text(aes(label=labels), hjust=-0.05, size=2) +
    theme(axis.text.x = element_text(angle = 90),
          plot.title = element_text(size = 8)) + 
    theme_grey() +
    #scale_x_continuous(breaks = seq(0,1,0.1)) +
    ggtitle(paste("Incomplete Data Summary for Columns in",
                  deparse(substitute(data)))) + 
    xlab(NULL) +
    ylab("Percent NA") + coord_flip()
}

na_cols(full_data)

Most of the incomplete records have data missing from the NYC Street Trees table. Since this data represents a census of all of the street trees in New York City, these records should not be excluded from any analysis. Most of these missing values are attributable to trees marked as either “Dead” or “Stump” in the ‘status’ column. There are 31,615 dead trees and stumps in the dataset. Several of the attributes would not be expected to be collected for these, such as species names or evidence of stewardship. The following is a summary of those columns with missing values and how they are handled:

Tree Condition

There is one missing value in the ‘condition’ column. It is a live tree that had a missing value in the ‘health’ column, which transferred over to the ‘condition’ column. The value of this will be set to the middle value of the ‘health’ column, which is “Fair”. Because there is only one such record, it is easily done manually using the subsetting function of base R.

imputed_condition <- full_data
imputed_condition$condition[is.na(imputed_condition$condition)] <- "Fair"
imputed_condition <- imputed_condition
imputed_condition %>% select(condition) %>% is.na() %>% colSums()

## condition 
##         0

Tree Species

There are 31,619 missing values in the ‘spc_common’ and ‘spc_latin’ columns that record the name and species of each tree. Of these, 31,615 of these are marked either “Stump” or “Dead” in the ‘status’ column. There is one dead tree that does have a species name, and five live trees that are missing values.

There are a few ways that could be used to impute these values.

One way would be to use the species found most frequently in the dataset. The London plaintree is the most common tree species recorded, while the honeylocust, Callery pear, pin oak, and Norway maple round out the top five. However, there are only 87,014 London plantrees, so imputing an additional 31,615 dead trees and stumps would inflate this number by 36%.

imputed_condition %>% group_by(spc_common) %>% summarise(n=n()) %>% arrange(desc(n)) %>% top_n(5)

Another way would be to use the species whose average diameter in the dataset is closest to the diameter of each tree missing a species name. However, as will be seen below, not all trees in the dataset (including stumps and dead trees) have a diameter recorded.
A third way would be to use the species name of the tree closest in space to each tree missing a species name. That would follow the assumption that simliar trees tend to be planted close to each other. This assumption could be checked by examining each pair of neibouring trees for matching species. But both that check and the imputation itself would involve a complex and potentially computationally costly nearest neighbour calculation. Although there are many algorithms that have optimised this function to various degrees, they are considered to be outside the scope of this analysis.
The last and simplest way to handle these missing values is the one chosen here, which it to simply code both species variables, ‘spc_common’ and ‘spc_latin’, as “Unknown”.

imputed_species <- imputed_condition
imputed_species$spc_common[is.na(imputed_species$spc_common)] <- "Unknown"
imputed_species$spc_latin[is.na(imputed_species$spc_latin)] <- "Unknown"
imputed_species %>% select(spc_common, spc_latin) %>% is.na() %>% colSums()

## spc_common  spc_latin 
##          0          0

Tree Context

There are 31,616 missing values in the ‘guards’ and ‘sidewalk’ columns, and 31,615 missing values in the ‘steward’ column. Of these, 31,615 are marked either “Stump” or “Dead” in the ‘status’ column. According to the rules in the data dictionary, these values are not recorded for dead trees or stumps. Therefore those records will be coded with the value “Not appicable” for these three columns. HOwever, there is one live tree with a missing value in ‘guards’, and one live tree with a missing value in ‘sidewalk’. These represent true missing values, and will be coded according to the assumption that if these conditions had been observed they would have been recorded. Under that assmuption, a missing value would be equivalent to a negative observed value, and they therefore are coded as such.

imputed_context <- imputed_species

# Code live trees as negative case for each binary context variable
imputed_context$guards[(is.na(imputed_context$guards) & 
                          imputed_context$status == "Alive")] <- "None"
imputed_context$sidewalk[(is.na(imputed_context$sidewalk) & 
                            imputed_context$status == "Alive")] <- "NoDamage"

# Code remaining missing values, all for dead trees or stumps, as "Not applicable"
imputed_context <- imputed_context %>%
  # Remove factor levels to allow adding a new level
  mutate(steward  = as.character(steward),
         guards   = as.character(guards),
         sidewalk = as.character(sidewalk)) %>%
  # Impute dead trees and stumps with "Not applicable" for context variables
  mutate(steward  = ifelse(dead_or_stump_mask, "Not applicable", steward),
         guards   = ifelse(dead_or_stump_mask, "Not applicable", guards),
         sidewalk = ifelse(dead_or_stump_mask, "Not applicable", sidewalk)) %>%
  # Re-establish factors and ordered factors with new levels included
  mutate(steward = ordered(steward, levels = c("Not applicable", "None",
                                               "1or2","3or4","4orMore")),
         guards = factor(guards),
         sidewalk = factor(sidewalk))

imputed_context %>% select(steward, guards, sidewalk) %>% is.na() %>% colSums()

##  steward   guards sidewalk 
##        0        0        0

Median Household Income

Of the 39,327 incomplete records, 8,181 are missing data from the ACS Median Household Income dataset. These correspond to census tracts that did not have a sufficient sample size to estimate a median household income or margin of error. Some of these may not be residential areas, and others may just had too few residents in the sample. There is not a meaningful way to impute this data, so these records would likely be excluded from future analysis of trees and income levels. Therefore they are left as missing.

Zero-Value Diameter

Although there are no NA values in the variable for the combined diamteter of all trees, ‘diam_cm’, there are 278 records with a value of zero. Assuming these haven’t been rounded down from a value of less than one, they should be considered missing values. Missing numeric values be inputed from other data using a variety of methods based on the analysis context. In this case, zero values will be imputed using the mean diameter value of different groups. Where a record has a species name recorded, the mean value used for imputation will be the mean value of other trees of that species in the data. Where no species name exists, as is the case for most stumps and dead trees, the global average mean diameter value is used. Here are the steps in the processes to achieve those two imputations:

The ‘diam_cm’ variable for trees with zero-value diameters and non-null species names is set to missing (NA).
Then, the dataset is grouped by species name, in this case ‘spc_latin’.
Next, missing (NA) values in the ‘diam_cm’ column are replaced with the mean value of ‘diam_cm’ for each species group.
Finally, the dataset is ungrouped to return the original dataframe.
To impute the rest of the zero-value diameters, all remaining ‘diam_cm’ values of zero ar set to missing (NA).
Then, the impute() function from the mlr packages is used to impute missing values. The arguments for that function are the dataset, the column names with imputation function in a list - in this case just the diam_cm column and the function imputeMean().
The mlr impute fuction returns a named list with two items: the imputed dataset called ‘data’ and a decription of the imputations performed called ‘descr’. Just the data element is pulled out as the dataframe to carry forward.

A summary of the ‘diam_cm’ value form the imputed dataset is shown, below. The minimum value is greater than zero and no NA values remain, which means the imputation was successful.

# Impute zero values in diam_cm for trees with a species name to species mean diameter

imputed_diam_live <- imputed_context %>%
  # Set diam_cm zero values for trees with species name to missing (NA)
  mutate(diam_cm = ifelse((!is.na(spc_latin) & diam_cm == 0), NA, diam_cm)) %>%
  # Group by species name
  group_by(spc_latin) %>%
  # Replace diam_cm with mean value of species category where diam_cm is NA
  mutate(diam_cm = replace(diam_cm, is.na(diam_cm), mean(diam_cm, na.rm = TRUE))) %>%
  # Remove grouping to get back to full dataframe
  ungroup()

# Impute zero values in diam_cm for trees with no species with global mean
# Set remaining zero values to missing (NA)
missing_diam <- imputed_diam_live %>% 
  mutate(diam_cm = ifelse((diam_cm == 0), NA, diam_cm))
# Use mlr's impute() function to impute mean of diam_cm to NAs
imputation <- mlr::impute(missing_diam, cols = list(diam_cm = imputeMean()))
# The impute() function returns a list with decription, so just grab data
imputed_data <- imputation$data

# Check there are no longer zero values or NAs in diam_cm
print("Summary of 'diam_cm' after imputation of zeros")

## [1] "Summary of 'diam_cm' after imputation of zeros"

imputed_data$diam_cm %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.54   12.70   25.40   29.76   40.64 1143.00

Bounding Box for Latitude & Longitude

It is also useful to check that the latitude and longitude values in the data are not outside the geographic extents of New York City. One quick way to check is to compare the minimum and maximum values for latitude and longitude to the bounding box of the city, which can be found in the metadata for the spatial file for the New York City boroughs on the NYC Open Data Portal. It gives the bounding box values as:

# Bounding box from metadata for NYC Borough Boundary on NYC Open Data Portal
# https://www1.nyc.gov/assets/planning/download/pdf/data-maps/open-data/nybb_metadata.pdf?ver=18c
bbox_N <-  40.915568 # latitude must be less than north extent
bbox_S <-  40.495992 # latitude must be greater than south extent
bbox_E <- -73.699215 # longitude must be less than east extent
bbox_W <- -74.257159 # longitude must be greater than west extent
bbox <- data.frame(BoundingBox = c("Latitude","Longitude"),
                   Extents  = c(paste(bbox_N, "degrees North to", bbox_S, "degrees North"),
                                paste(bbox_W, "degrees West to", bbox_E, "degrees West")))
bbox %>% knitr::kable(format = "latex", booktabs = TRUE, 
                      caption = "Bounding Box for New York City")

A rule can be set using the editset() function of the editrules package that states the following conditions:

latitude values must fall between the northern and southern extents of the bounding box
longitude value must fall between the western and eastern extents of the bounding box

That rule can then be checked using the violatedEdits() function of the editrules package. A summary of that check confirms that none of the rows in the final dataset have latitude or longitude values that fall outside the bounding box of the city. This does not representa a full spatial check, because the actual city bondaries are not a square. A point that satisfies the bounding box rule could still fall in Hoboken, New Jersey, for example, since that city is both east and north of Staten Island. It could also fall in the middle of the East River, since the bounding box contains no information about which areas are land or sea. But it is still a useful quick check for obvious errors at this point in the preprocessing stage.

bounds_rule <- editset(c("latitude  >= bbox_S", "latitude  <= bbox_N",
                         "longitude >= bbox_W", "longitude <= bbox_E"))

out_of_bounds <- violatedEdits(bounds_rule, imputed_data)
summary(out_of_bounds)

## NULL

Scan II: Outliers

There are two main numeric columns in the dataset: tree diameter and estimated median household income of the census tract that each tree falls within. The following table provides basic summary statistics for each.

# This is the R chunk for the Scan II
stat_labs <- c("Minimum","1st Quartile","Median","Mean","3rd Quartile","Maximum")
stat_diam <- round(as.vector(summary(imputed_data$diam_cm)[1:6]),2)
stat_inc  <- as.vector(summary(imputed_data$Median_HH_Income[1:6]))
stat_lat  <- as.vector(summary(imputed_data$latitude[1:6]))
stat_long <- as.vector(summary(imputed_data$longitude[1:6]))
num_stats <- data.frame(stat_labs, stat_diam, stat_inc)#, stat_lat, stat_long)
colnames(num_stats) = c("Statistic","diam_cm","Median_HH_Income")#,"Latitude","Longitude")
num_stats %>% knitr::kable(format = "latex", booktabs = TRUE, 
                           caption = "Summary Statistics for Numeric Variables")

There appear to be many significant outliers in the ‘diam_cm’ column on the high end. The maximum diameter recorded is 1143 cm, which is almost the same diameter as the widest tree in the world - a Montezuma cypress in Oaxaca, Mexico called the ‘Arbol del Tule’. There are many large values which do not seem to be plausible for street trees in New York City. It is possible this is due to one or more user errors, such as measuring in centimetres but recording the results as inches or mistakes converting from circumference to diameter. Committing these two together could lead to an eightfold error. It does not appear there are systemic errors which could be solved with a global correction or patterns in the outliers that could lead to suitable corrections. Typical trunk diameters of different species could be used to cap tree diameters at a maximum for each. But there are 132 different species represented in the dataset, making research required to set these maximums quite a time-consuming task.

# Based on temlplate from the R Graph Gallery
# https://www.r-graph-gallery.com/82-boxplot-on-top-of-histogram.html

# Create data 
my_variable = imputed_data$diam_cm
 
# Layout to split the screen
layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
 
# Draw the boxplot and the histogram 
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(my_variable , horizontal=TRUE , xaxt="n" , col='palegreen' , frame=F)
par(mar=c(4, 3.1, 1.1, 2.1))
hist(my_variable , breaks=40 , col='light grey' , border=T , main="", 
     xlab="Tree Diameter (cm)", xlim=c(-10,1200))

There are outliers in the median household income column, but they are not necessarily problematic. Income levels are quite stratified across the United States, particularly in a city as large and diverse as New York City. But the frequency that they appear in the data is also related to the number of trees in each census district. Those district with many trees will have their median household income level repeated many times in the dataset.

db     <- full_data %>% ggplot(aes(y=diam_cm)) + geom_boxplot()
dh     <- full_data %>% ggplot(aes(x=diam_cm)) + geom_histogram()
plot_grid(dh, db, labels = "AUTO")

full_data %>% ggplot(aes(x=Median_HH_Income)) + geom_histogram()
full_data %>% ggplot(aes(y=Median_HH_Income)) + geom_boxplot()

# Based on temlplate from the R Graph Gallery
# https://www.r-graph-gallery.com/82-boxplot-on-top-of-histogram.html

# Create data 
my_variable = imputed_data$Median_HH_Income
 
# Layout to split the screen
layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
 
# Draw the boxplot and the histogram 
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(my_variable , horizontal=TRUE , xaxt="n" , col='palegreen' , frame=F)
par(mar=c(4, 3.1, 1.1, 2.1))
hist(my_variable , breaks=40 , col='light grey' , border=T , main="" ,
     xlab="Median Household Income (USD)")

Transform

The main numeric variables in the dataset (diam_cm and Median_HH_Income) do not follow a normal distribution. Many statistical modelling techniques rely on assumptions of normality and homogenaity of variances. It can therefure be useful to transform variables that do not meet these criteria so that they do.

The diam_cm variable is highly skewed to the right, since most of the trees have small diameters and very few have large ones. A Box-Cox transformation perfomed using the BoxCox() function of the forecast package can perform an automatic search for the best lambda value. In this case that value is very close to zero. A lambda of 0 in a Box-Cox transformation is equivalent to a log transformation, so that transformation is used for the ‘diam_cm’ variable, an stored in a new variable called ‘log_diam_cm’.

# This is the R chunk for the Transform Section
dh      <- imputed_data %>% ggplot(aes(x=diam_cm)) + geom_histogram() + theme_grey()
dh_log  <- ggplot(imputed_data, aes(x=log(diam_cm))) + geom_histogram() + theme_grey()
dh_sqrt <- ggplot(imputed_data, aes(x=sqrt(diam_cm))) + geom_histogram() + theme_grey()
dh_2    <- ggplot(imputed_data, aes(x=(diam_cm^2))) + geom_histogram() + theme_grey()

#lambda_guerrero_diam_cm <- BoxCox.lambda(full_data$diam_cm)
#bc_lambda_loglik_diam_cm   <- BoxCox.lambda(full_data$diam_cm+1, method = "loglik")

plot_grid(dh + ggtitle("Before Transformation"), dh_log + ggtitle("log Transformation"),
          dh_sqrt + ggtitle("Square Root Transformation"), dh_2 + ggtitle("Squared Transformation"))

The median household income variable is also skewed to the right, with more trees in lower-income census tracts than higher-income ones. The plots below show a number of different possible transformations of the data, with the square root transformation giving the most symetric distribution. This is the transformation that will be used for the data, in a new column called ‘sqrt_med_hh_inc’.

ih      <- imputed_data %>% ggplot(aes(x=Median_HH_Income)) + geom_histogram() + theme_grey()
ih_log  <- ggplot(imputed_data, aes(x=log(Median_HH_Income))) + geom_histogram() + theme_grey()
ih_sqrt <- ggplot(imputed_data, aes(x=sqrt(Median_HH_Income))) + geom_histogram() + theme_grey()
ih_2    <- ggplot(imputed_data, aes(x=(Median_HH_Income^2))) + geom_histogram() + theme_grey()
plot_grid(ih + ggtitle("Before Transformation"), ih_log + ggtitle("log Transformation"),
          ih_sqrt + ggtitle("Square Root Transformation"), ih_2 + ggtitle("Square Transformation"))

transformed_data <- imputed_data %>%
  mutate(log_diam_cm = log(diam_cm),
         sqrt_Median_HH_Income = sqrt(Median_HH_Income))
transformed_data %>% head() %>% select(tree_id, diam_cm, log_diam_cm, 
                                       Median_HH_Income, sqrt_Median_HH_Income)

Final Data

The final dataset is created from a selected set of variables after all of the preprocessing operation have been completed: data type changes, renaming columns, combining columns, imputing missing values, correcting obvious errors, accounting for outliers, creating new features, and transforming skewed numeric variables. The following is a feature summary of the final table for analysis.

final_data <- transformed_data %>%
  select(tree_id, created_at, diam_cm, log_diam_cm, spc_common, curb_loc, 
         guards, sidewalk, condition, steward, Stones, MetalGrates, 
         RootOther, WiresRope, TrunkLights, TrunkOther, BranchLights, 
         Sneakers, BranchOther, latitude, longitude, Income_Lower, 
         Median_HH_Income, Income_Upper, sqrt_Median_HH_Income)
final_summary <- final_data %>% summarizeColumns() 
final_summary %>% select(-mad, -disp) %>%
  knitr::kable(format = "latex", booktabs = TRUE, 
               caption = "Final Feature Summary After Preprocessing")

Data Sources

NYC Street Trees Data:

City of New York 2020, 2015 Street Tree Census - Tree Data, City of New York, accessed 1 June 2020, https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh.

ACS Median Household Income Data:

United States Census Bureau 2020, B19013: MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2018 INFLATION-ADJUSTED DOLLARS), United States Census Bureau, accessed 1 June 2020, https://data.census.gov/cedsci/table?q=B19013%3A%20MEDIAN%20HOUSEHOLD%20INCOME%20IN%20THE%20PAST%2012%20MONTHS%20%28IN%202018%20INFLATION-ADJUSTED%20DOLLARS%29&hidePreview=true&tid=ACSDT1Y2018.B19013.

New York City Spatial Data:

New York City Department of Planning 2020, 2010 Census Tracts, City of New York, accessed https://data.cityofnewyork.us/City-Government/2010-Census-Tracts/fxpq-c8ku

References

City of New York 2020, 2015 Street Tree Census - Tree Data, City of New York, accessed 1 June 2020, https://data.cityofnewyork.us/api/views/uvpi-gqnh/files/8705bfd6-993c-40c5-8620-0c81191c7e25?download=true&filename=StreetTreeCensus2015TreesDataDictionary20161102.pdf.

Holz, Y 2018, Boxplot on top of histogram, The R Graph Gallery, viewed 1 une 2020, https://www.r-graph-gallery.com/82-boxplot-on-top-of-histogram.html.

Lallanilla, M 2019, The Greatest Trees of the World, the spruce, accessed 1 June 2020, https://www.thespruce.com/great-trees-of-the-world-1709081#:~:text=The%20Widest%20Tree%20in%20the%20World&text=Measured%20at%20the%20standard%20breast,can%20match%20Arbol%20del%20Tule.

NYC Parks 2015, TreesCount 2015: Training Manual, City of New York and TreeKIT, viewed 1 June 2020, https://www.nycgovparks.org/pagefiles/116/trees-count-2015-training__592dcbad8488f.pdf.

NYC Parks 2017, 2015 STREET TREE CENSUS REPORT, City of New York, viewed 1 June 2020, http://media.nycgovparks.org/images/web/TreesCount/Index.html.

NYC Parks 2017, TreesCount! 2015-2016 Street Tree Census, City of New York, viewed 1 June 2020, https://www.nycgovparks.org/trees/treescount.

United States Census Bureau, American Community Survey (ACS), United States Census Bureau, viewed 1 June 2020, https://www.census.gov/programs-surveys/acs.

United States Census Bureau 2018, DATA GEMS: What is a Census Tract? Making Sense of Census Geography, viewed 1 June 2020, https://www.census.gov/data/academy/data-gems/2018/tract.html.

United States Census Bureau, Explore Census Data, United States Census Bureau, viewed 1 June 2020, https://data.census.gov/cedsci/.

United States Census Bureau 2020, Understanding Geographic Identifiers (GEOIDs), viewed 1 June 2020, https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html.

Wickham, H 2014, ‘Tidy Data’, Journal of Statistical Software, Vol. 59, Issue 10.

MATH2349 Semester 1, 2020

Data Wrangling: Assignment 2

Jonathan Wright (s3640299)