Data Visualization - Python: NYC Tree Census

Introduction

This project explores the 2015 New York City Street Tree Census using Python visualizations. In a densely urban environment, I wanted to better understand the relationships between tree species, tree diameter, stewardship, health, and general geographic locations. New York City is known for its varying neighborhood identities, so I was curious if this could be translated to plant life as well. Multiple chart types (Bar Chart, Strip Plot, Histogram, Boxplot and a Map) are used for this exploration.

Dataset

The dataset used for this exploration comes from the 2015 New York City Street Tree Census. It includes information on tree species, diameter, health, stewardship, and geographic coordinates. This dataset is particularly large, so for multiple datasets, samples were used for visual ease and interpretation. The main variables used in this project are spc_common, tree_dbh, steward, health, latitude, and longitude.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

sns.set_theme(style="whitegrid")

# Load the dataset
df = pd.read_csv("/Users/loganvarra/Downloads/2015_Street_Tree_Census_-_Tree_Data_20260322.csv")

# Sample once for consistency
df = df.sample(5000, random_state=1)

Descriptive Statistics

The descriptive statistics summarize tree diameter and show the number of observations across the health and stewardship categories. These summaries offer some context for the distribution of tree size, along with frequency of observations across categories.

## count    5000.00000
## mean       11.16600
## std         8.36744
## min         0.00000
## 25%         4.00000
## 50%         9.00000
## 75%        16.00000
## max        53.00000
## Name: tree_dbh, dtype: float64

## health
## Good    3856
## Fair     718
## Poor     182
## Name: count, dtype: int64

## steward
## 1or2       1050
## 3or4        122
## 4orMore      14
## Name: count, dtype: int64

Visualizations

This section presents five different visualizations that aim to analyze different parts of the dataset.

Top Tree Species

The Top Tree Species have been visualized using a bar chart. The ten most common species are represented in descending order. The most prevalent is the London Planetree, with a count of 654. The tenth most common is the Sophora with a count of 154. There is a top heavy concentration of number of trees within species, which indicates that there may not be a wide diversity of different types of trees within NYC. While this graph does not provide an explanation as to the reasons why there is limited tree diversity, urban factors and stewardship may have an impact. Additionally, as a sample was used for analysis (due to the size of the dataset), less common trees may exist, but in smaller populations that were not captured in the top ten.

# Count the top 10 most common tree species
common_species = df["spc_common"].value_counts().head(10).reset_index()
# Renaming the columns for consistency
common_species.columns = ["Species", "Count"]

# Create horizontal bar chart
plt.figure(figsize=(10, 6))
ax = sns.barplot(data=common_species, y="Species", x="Count", hue="Species", dodge=False, palette="Greens_r", legend=False)

#Steps to make a loop to show the number of trees for each species. At the position, i, we added a value count, v
#Labels were made in black for ease of visualization. They are offset from the bars.  
for i, v in enumerate(common_species["Count"]):
#offset from the value count so the label is away from the bar
    ax.text(
      v + 5, i,  f"{v:,}", va="center", fontsize=10, color="black")

plt.title("Top 10 Most Common Tree Species in the NYC Sample")
plt.xlabel("Count")
plt.ylabel("Tree Species")
plt.tight_layout()
plt.show()

Tree Diameter by Stewardship Level

This strip plot compares tree diameter across stewardship levels. The stewardship levels are described as “None,” “1 or 2,” “3 or 4,” or “4 or more.” The higher the stewardship level, the more human care and interaction is given to a tree. There is a very large level of variability present in this data. While there is little evidence towards tree health with no stewardship, the trees with some level of stewardship have been seen as able to grow to thicker diameters. However, diameter size may also be dependent on species type, age, environmental factors, etc.

import matplotlib.pyplot as plt
import seaborn as sns

# Keep only rows with stewardship and tree diameter values
df_stew = df.dropna(subset=["steward", "tree_dbh"]).copy()

# Clean stewardship labels for readability
df_stew["steward"] = df_stew["steward"].replace({
    "None": "None",
    "1or2": "1 or 2",
    "3or4": "3 or 4",
    "4orMore": "4 or More"
})
#Define the correct order with labels
#ordering prevents overplotting which occurs when a lot of points are stacked on top of each other. 
#jitter is used to prevent this in the visualization
order = ["None", "1 or 2", "3 or 4", "4 or More"]

# Create strip plot
plt.figure(figsize=(10, 6))
sns.stripplot(
    data=df_stew,
    x="steward",
    y="tree_dbh",
    order=order,
    hue="steward",
    dodge=False,
    palette="Greens",
    alpha=0.4,
    jitter=True,
    legend=False
)

plt.title("Tree Diameter by Stewardship Level")
plt.xlabel("Stewardship Level")
plt.ylabel("Tree Diameter (DBH, inches)")
plt.tight_layout()
plt.show()

Distribution of Tree Diameter

This visualization is a histogram of tree diameter across the sampled observations. There is a clear right-skewed pattern. A majority of trees have smaller diameters, while a small subset of the trees have a much larger diameter, creating a long tail to the right of the distribution. A conclusion that can be drawn from this data is that most of the trees in the data set are younger or smaller-growing. In a densely urban environment, this would make sense as there is limited space or resources available for large trees.

import matplotlib.pyplot as plt
import seaborn as sns

# Remove rows with missing diameter values
df_diam = df.dropna(subset=["tree_dbh"])

# Create histogram with density curve
plt.figure(figsize=(10, 6))
sns.histplot(
    data=df_diam,
    x="tree_dbh",
    bins=30,
#Use a Kernel Density Estimation to add a curve over the histogram
#This helps to visualize the peaks for concentration
    kde=True,
#Looked up a nice soft green color for the graph
    color="#66BB6A",
#Add an edge color to separate the bars
    edgecolor="white",
    linewidth=1
)

plt.title("Distribution of Tree Diameter")
plt.xlabel("Tree Diameter (DBH, inches)")
plt.ylabel("Frequency")
#use these steps so the chart is adjusted to the 0,0 corner
plt.xlim(left=0)

plt.ylim(bottom=0)

plt.tight_layout()
plt.show()

Tree Diameter by Health Condition

This boxplot compares tree diameter across health categories allowing for a comparison of central tendency and variability. Trees that fall within the “Good” health categorization have a higher median diameter and range. This may mean that healthier trees are larger and more varied. “Poor” health trees are, in general, smaller. Trees that are struggling to grow may make up part of this population. Overall, the data shows many outliers for each group, which supports the theory that health and size may be related, but are not perfectly correlated.

import matplotlib.pyplot as plt
import seaborn as sns

# Keep only rows with health and diameter values
df_health = df.dropna(subset=["health", "tree_dbh"]).copy()

order = ["Poor", "Fair", "Good"]

# Create boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(
    data=df_health,
    x="health",
    y="tree_dbh",
    order=order,
    hue="health",
    dodge=False,
    palette="Greens",
    legend=False
)

plt.title("Tree Diameter by Health Condition")
plt.xlabel("Health Condition")
plt.ylabel("Tree Diameter (DBH, inches)")
plt.tight_layout()
plt.show()

Geographic Distribution of Trees

This is an interactive map which shows the geographic distribution of sampled trees across New York City. Marker size and color both represent tree diameter, making it easier to identify where larger trees appear within the sampled observations. The layout of the visualization allows for the user to explore different areas. There are a variety of tree types and sizes across all the boroughs, however, certain areas appear to have a higher concentration of larger/more established trees than others. For example, lower Manhattan, in general, has smaller and less frequent trees than parts of Brooklyn. This makes sense due to the geography of the two areas, where lower Manhattan is very densely populated with large buildings and infrastructure.

#eval=false is used here so the code does not process. A previous chunk has been ran to render the visualization.
import plotly.express as px

# Keep rows with valid coordinates and diameter values
df_map = df.dropna(subset=["latitude", "longitude", "tree_dbh"]).sample(2000, random_state=1)

# Create an interactive scatter map of tree locations across New York City
fig = px.scatter_map(
    df_map,
    lat="latitude",
    lon="longitude",
    color="tree_dbh",
    size="tree_dbh",
    size_max=15,
    color_continuous_scale="YlGn",
    hover_name="spc_common",
    hover_data={
        "tree_dbh": True,
        "health": True,
        "steward": True,
        "latitude": False,
        "longitude": False
    },
    zoom=10,
    height=700,
    title="Geographic Distribution of Trees in New York City"
)

# Improve map appearance and label the color scale more clearly
fig.update_layout(
    map_style="open-street-map",
    coloraxis_colorbar=dict(
        title="Tree Diameter (inches)"
    ),
    title_x=0.5
)

# Make markers slightly transparent to reduce overlap
fig.update_traces(
    marker=dict(opacity=0.6)
)

# Convert the Plotly figure to HTML so it can be rendered once in R Markdown
map_html = fig.to_html(full_html=False, include_plotlyjs="cdn")

Conclusion

As a whole, this analysis highlights several important patterns within the New York City tree population. The data suggests that a relatively small number of different trees dominate NYC, reflecting the difficult balance between urban development and maintaining natural biodiversity. Tree diameter is not evenly distributed, with most trees being smaller and only a limited number reaching larger sizes. This leads to the interpretation that NYC is home to more young or small-growing tree species than larger trees.

Additionally, stewardship and health appear to have some relationship with tree size, although neither are perfectly correlated nor can they explain the variation. This suggests that multiple influences, including species, location, and environmental conditions, impact tree growth and size. The geographic visualization further reinforces that trees are widespread across the city, but their characteristics vary spatially, which sheds some light into the varying characteristics across NYC neighborhoods and boroughs.

Overall, these visualizations demonstrate how data can be used to better understand urban forestry and highlight the ongoing work needed to maintain a diverse and healthy tree population in a dense metropolitan environment.