library(ggplot2)
library(dplyr)
library(readr)
R has two main plotting systems: base R
(plot(), hist(), barplot()) and
ggplot2. We use ggplot2 because it follows a consistent
grammar — once you learn the grammar, you can build any
visualization by combining a small set of building blocks.
The “gg” in ggplot2 stands for Grammar of Graphics, a framework where every plot is described by the same set of components.
In this tutorial, we learn the grammar using real FIA (Forest Inventory and Analysis) data — the same dataset you will use throughout this course to investigate tree species distributions, community composition, and biogeographic patterns across the eastern United States.
Every ggplot2 plot is built from these components:
| Component | What it does | Required? |
|---|---|---|
| Data | The data frame to plot | Yes |
Aesthetics (aes()) |
Maps columns to visual properties (x, y, color, size, …) | Yes |
| Geom | The type of mark to draw (points, bars, lines, …) | Yes |
| Facets | Split into sub-panels | No |
| Stats | Statistical transformations (counts, means, smoothers) | No (many geoms have defaults) |
| Scales | Control how data values map to visual values (axis limits, color palettes) | No (sensible defaults) |
| Theme | Non-data visual elements (fonts, gridlines, background) | No (sensible defaults) |
You build a plot by adding layers with the
+ operator:
ggplot(data, aes(x = ..., y = ...)) +
geom_...() +
labs() +
theme_...()
We will use the eastern US FIA dataset containing ~2.8 million tree records across 31 states. Let’s load and prepare the data.
# Load tree data and species reference table
tree <- read_csv("fia_eastern31_recent.csv", show_col_types = FALSE)
ref_species <- read_csv("REF_SPECIES_trimmed.csv", show_col_types = FALSE)
# Join to get common names
tree <- tree %>% left_join(ref_species, by = "SPCD")
# Filter to live trees only
live <- tree %>% filter(STATUSCD == 1)
# For quick examples, we'll also create a Tennessee subset
tn <- live %>% filter(STATE_ABBR == "TN")
# Look at the data
head(tn, 5)
## # A tibble: 5 × 48
## TREE_CN PLT_CN INVYR STATECD COUNTYCD PLOT SUBP SPCD SPGRPCD DIA HT
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.34e14 5.04e14 2017 47 1 1 1 611 34 13.3 101
## 2 7.34e14 5.04e14 2017 47 1 1 1 611 34 9.2 62
## 3 7.34e14 5.04e14 2017 47 1 1 1 621 39 11.8 87
## 4 7.34e14 5.04e14 2017 47 1 1 1 611 34 10.3 74
## 5 7.34e14 5.04e14 2017 47 1 1 1 611 34 5.9 62
## # ℹ 37 more variables: ACTUALHT <dbl>, STATUSCD <dbl>, CR <dbl>, CCLCD <dbl>,
## # TPA_UNADJ <dbl>, DRYBIO_AG <dbl>, CARBON_AG <dbl>, COND_STATUS_CD <dbl>,
## # FORTYPCD <dbl>, SITECLCD <dbl>, STDAGE <dbl>, OWNCD <dbl>, OWNGRPCD <dbl>,
## # SLOPE <dbl>, ASPECT <dbl>, DSTRBCD1 <dbl>, BALIVE <dbl>, LAT <dbl>,
## # LON <dbl>, ELEV <dbl>, ECOSUBCD <chr>, MEASYEAR <dbl>, TPA <dbl>,
## # EXPALL <dbl>, EXPCURR <dbl>, GRIDID <dbl>, centroidLat <dbl>,
## # centroidLon <dbl>, STATE_ABBR <chr>, COMMON_NAME <chr>, GENUS <chr>, …
Key columns we will use:
| Column | Description |
|---|---|
DIA |
Diameter at breast height (inches) |
HT |
Total tree height (feet) |
COMMON_NAME |
Species common name (from REF_SPECIES) |
SFTWD_HRDWD |
Softwood (“S”) or Hardwood (“H”) |
STATE_ABBR |
State abbreviation |
centroidLat, centroidLon |
Grid cell center coordinates |
ELEV |
Elevation (feet) |
BALIVE |
Live-tree basal area per acre (sq ft) |
ggplot(data = tn, aes(x = DIA, y = HT)) +
geom_point()
What happened here:
ggplot(data = tn, ...) — told ggplot which data frame
to use (Tennessee trees)aes(x = DIA, y = HT) — mapped diameter to the x-axis,
tree height to the y-axisgeom_point() — drew each tree as a pointYou can already see an ecological pattern: diameter and height are positively correlated, but the relationship isn’t linear — height plateaus at larger diameters. This is a fundamental growth pattern in trees.
The aes() function maps columns in your
data to visual properties of the plot. Common
aesthetics include:
| Aesthetic | Controls |
|---|---|
x |
Position on x-axis |
y |
Position on y-axis |
color |
Color of points, lines, outlines |
fill |
Fill color of bars, areas |
size |
Size of points |
shape |
Shape of points |
alpha |
Transparency (0 = invisible, 1 = opaque) |
linetype |
Line style (solid, dashed, dotted) |
ggplot(tn, aes(x = DIA, y = HT, color = SFTWD_HRDWD)) +
geom_point(size = 1.5, alpha = 0.3) +
labs(color = "Wood Type")
Each wood type gets a different color automatically. ggplot2 added a legend because the color carries information. Notice that softwoods (conifers like pines) tend to be taller for a given diameter than hardwoods — they invest more in vertical growth.
There is an important difference:
aes()): links a visual
property to a column — it varies with the dataaes()): applies a
fixed value to all points# MAPPING: color varies by wood type
ggplot(tn, aes(x = DIA, y = HT, color = SFTWD_HRDWD)) +
geom_point(alpha = 0.3)
# SETTING: all points are steelblue
ggplot(tn, aes(x = DIA, y = HT)) +
geom_point(color = "steelblue", alpha = 0.3, size = 1.5)
Common mistake: Putting a fixed color inside
aes():
# WRONG — ggplot treats "blue" as a data value, not a color
ggplot(tn, aes(x = DIA, y = HT, color = "blue")) +
geom_point()
Each geom_*() function draws a different type of mark.
Here are the ones you will use most often.
ggplot(tn, aes(x = DIA)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Diameter Distribution of Trees in Tennessee",
x = "DBH (inches)", y = "Count")
Key arguments: binwidth controls the width of each bar.
Smaller = more detail, larger = smoother.
Notice the shape: a strong right skew, with many small trees and few large ones. This is the classic reverse-J diameter distribution — a hallmark of an uneven-aged, self-replacing forest. Most trees are young and small; only a few survive to become large.
# Compare diameter distributions: softwoods vs. hardwoods
ggplot(tn, aes(x = DIA, fill = SFTWD_HRDWD)) +
geom_density(alpha = 0.4) +
labs(title = "Diameter Distribution: Softwoods vs. Hardwoods (TN)",
x = "DBH (inches)", fill = "Wood Type")
alpha controls transparency — essential when overlapping
multiple curves.
ggplot(tn, aes(x = DIA, y = HT)) +
geom_point(alpha = 0.2, color = "darkgreen", size = 0.8) +
labs(title = "Tree Height vs. Diameter in Tennessee",
x = "DBH (inches)", y = "Height (ft)")
With ~200,000 points, alpha (transparency) is critical.
Without it, the plot would be a solid blob. Low alpha values let you see
density — where are the MOST trees concentrated?
ggplot(tn, aes(x = DIA, y = HT)) +
geom_point(alpha = 0.1, size = 0.5) +
geom_smooth(method = "loess", color = "red", se = TRUE) +
labs(title = "Height-Diameter Relationship (TN)",
x = "DBH (inches)", y = "Height (ft)")
method = "loess" fits a local regression (flexible
curve). Other options: "lm" for a straight line.se = TRUE shows the confidence band around the
line.Use geom_col() when you have already calculated the
values (e.g., means, counts). Use geom_bar() when you want
ggplot to count for you.
# Count stems per species and get top 10
top10_tn <- tn %>%
count(COMMON_NAME, sort = TRUE) %>%
slice_head(n = 10)
ggplot(top10_tn, aes(x = reorder(COMMON_NAME, n), y = n)) +
geom_col(fill = "forestgreen") +
coord_flip() +
labs(title = "Top 10 Most Abundant Tree Species in Tennessee",
x = NULL, y = "Number of Stems")
reorder(COMMON_NAME, n) orders the bars by count
instead of alphabeticallycoord_flip() rotates the chart to make species names
readable# Compare diameter distributions of 5 common TN species
common5 <- c("yellow-poplar", "red maple", "white oak", "Virginia pine", "chestnut oak")
tn %>%
filter(COMMON_NAME %in% common5) %>%
ggplot(aes(x = COMMON_NAME, y = DIA, fill = COMMON_NAME)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Diameter Distribution of 5 Common TN Species",
x = NULL, y = "DBH (inches)") +
theme_minimal()
# Calculate mean diameter by latitude (1-degree bins)
lat_dia <- live %>%
mutate(lat_bin = round(centroidLat)) %>%
group_by(lat_bin) %>%
summarise(mean_DIA = mean(DIA, na.rm = TRUE), .groups = "drop")
ggplot(lat_dia, aes(x = lat_bin, y = mean_DIA)) +
geom_line(color = "darkblue", linewidth = 1) +
geom_point(color = "darkblue", size = 2) +
labs(title = "Mean Tree Diameter Along the Latitudinal Gradient",
x = "Latitude (°N)", y = "Mean DBH (inches)")
This reveals a biogeographic pattern: mean tree diameter varies with latitude. Think about why — does it reflect species composition, growing conditions, or forest management practices?
ggplot2’s power comes from stacking layers. Each
+ adds a new layer on top:
tn %>%
filter(COMMON_NAME %in% common5) %>%
ggplot(aes(x = DIA, y = HT, color = COMMON_NAME)) +
geom_point(alpha = 0.2, size = 1) + # layer 1: points
geom_smooth(method = "lm", se = FALSE, linewidth = 1) + # layer 2: trend lines
labs(
title = "Height-Diameter Relationships by Species",
x = "DBH (inches)",
y = "Height (ft)",
color = "Species"
) +
theme_minimal(base_size = 13)
Layers are drawn in order — later layers appear on top. This is why
we put geom_smooth() after geom_point().
Notice how the height-diameter relationship differs among species: Virginia pine (a softwood) grows taller for a given diameter, while oaks are shorter and wider. This reflects fundamental differences in growth strategy — pines invest in height to compete for light, while oaks invest in crown width and structural stability.
ggplot(top10_tn, aes(x = reorder(COMMON_NAME, n), y = n)) +
geom_col(fill = "steelblue") +
geom_text(aes(label = format(n, big.mark = ",")), hjust = -0.1, size = 3.2) +
coord_flip() +
labs(title = "Top 10 Tree Species in Tennessee",
x = NULL, y = "Number of Stems")
aes(label = ...) tells geom_text() what to
writehjust = -0.1 nudges the label slightly to the right of
the bar endformat(n, big.mark = ",") adds comma separators for
readabilityggplot(tn, aes(x = DIA, y = HT)) +
geom_point(alpha = 0.1) +
labs(
title = "Height-Diameter Relationship",
subtitle = "Live trees in Tennessee from FIA data",
x = "Diameter at Breast Height (inches)",
y = "Total Height (feet)",
caption = "Source: USDA Forest Service FIA Database"
)
Faceting splits your data into panels based on a categorical variable. This is extremely useful for comparing groups.
tn %>%
filter(COMMON_NAME %in% common5) %>%
ggplot(aes(x = DIA)) +
geom_histogram(binwidth = 2, fill = "forestgreen", color = "white") +
facet_wrap(~ COMMON_NAME) +
labs(title = "Diameter Distributions by Species (TN)",
x = "DBH (inches)", y = "Count") +
theme_minimal()
facet_wrap(~ COMMON_NAME) creates one panel per species.
You can control the layout with ncol = 2 or
nrow = 1.
Compare the shapes: some species (like red maple) have many small stems — they are prolific seedlings and understory trees. Others (like white oak) have a broader distribution with more large individuals — they are long-lived canopy dominants.
tn %>%
filter(COMMON_NAME %in% c("red maple", "white oak", "Virginia pine")) %>%
ggplot(aes(x = DIA, y = HT)) +
geom_point(alpha = 0.2, size = 0.5) +
facet_grid(SFTWD_HRDWD ~ COMMON_NAME) +
labs(title = "Height vs. Diameter by Species and Wood Type") +
theme_minimal()
facet_grid(rows ~ columns) creates a matrix of panels.
Here we split by wood type (rows) and species (columns). Notice that
Virginia pine only appears in the “S” (softwood) row, while red maple
and white oak appear only in “H” (hardwood) — each species belongs to
one wood type.
ggplot(tn, aes(x = DIA, y = HT, color = ELEV)) +
geom_point(alpha = 0.3, size = 1) +
scale_color_viridis_c(option = "viridis", name = "Elevation (ft)") +
labs(title = "Tree Size Colored by Elevation (TN)",
x = "DBH (inches)", y = "Height (ft)") +
theme_minimal()
Common color scales:
scale_color_viridis_c() — great for continuous data,
colorblind-friendlyscale_color_brewer(palette = "Set2") — good for
categorical datascale_color_manual(values = c("red", "blue")) — pick
your ownggplot(tn, aes(x = DIA, y = HT)) +
geom_point(alpha = 0.1) +
scale_x_continuous(breaks = seq(0, 50, by = 10), limits = c(0, 50)) +
scale_y_continuous(breaks = seq(0, 120, by = 20)) +
labs(title = "Custom Axis Breaks and Limits",
x = "DBH (inches)", y = "Height (ft)")
Themes control non-data elements: background, gridlines, fonts, legend position.
p <- ggplot(tn, aes(x = DIA, y = HT)) + geom_point(alpha = 0.1, size = 0.5)
library(gridExtra)
grid.arrange(
p + ggtitle("Default (theme_grey)"),
p + theme_minimal() + ggtitle("theme_minimal()"),
p + theme_bw() + ggtitle("theme_bw()"),
p + theme_classic() + ggtitle("theme_classic()"),
ncol = 2
)
You can also customize individual elements with
theme():
tn %>%
filter(COMMON_NAME %in% common5) %>%
ggplot(aes(x = DIA, y = HT, color = COMMON_NAME)) +
geom_point(alpha = 0.3, size = 1) +
labs(title = "Custom Theme Example", color = "Species") +
theme_minimal(base_size = 14) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold", hjust = 0.5),
panel.grid.minor = element_blank()
)
Here is the general pattern. Every ggplot2 call follows this structure:
ggplot(data = <DATA>, aes(<MAPPINGS>)) +
<GEOM_FUNCTION>(aes(<ADDITIONAL MAPPINGS>), <FIXED SETTINGS>) +
facet_...() + # optional
scale_...() + # optional
labs(...) + # optional but recommended
theme_...() # optional
Once you internalize this pattern, building any visualization becomes a matter of choosing the right geom and aesthetics.
Let’s build a plot step by step using the full eastern US dataset. We will map species richness — the number of unique tree species — across all 31 states.
richness <- live %>%
group_by(GRIDID, centroidLon, centroidLat) %>%
summarise(n_species = n_distinct(COMMON_NAME), .groups = "drop")
head(richness)
## # A tibble: 6 × 4
## GRIDID centroidLon centroidLat n_species
## <dbl> <dbl> <dbl> <int>
## 1 190 -81.8 24.7 3
## 2 191 -81.6 24.6 3
## 3 192 -81.4 24.6 4
## 4 423 -81.4 24.8 1
## 5 425 -81.0 24.7 2
## 6 889 -80.6 25.0 5
ggplot(richness, aes(x = centroidLon, y = centroidLat)) +
geom_point(aes(color = n_species), size = 0.8)
The shape of the eastern US is already visible — but the default color scale doesn’t highlight the pattern well.
ggplot(richness, aes(x = centroidLon, y = centroidLat)) +
geom_point(aes(color = n_species), size = 0.8) +
scale_color_viridis_c(option = "C") +
coord_quickmap()
coord_quickmap() adjusts the aspect ratio so the map
doesn’t look squashed. viridis option "C"
(“plasma”) makes the richness gradient pop.
ggplot(richness, aes(x = centroidLon, y = centroidLat)) +
geom_point(aes(color = n_species), size = 0.8, alpha = 0.8) +
scale_color_viridis_c(option = "C", name = "# Species") +
coord_quickmap() +
labs(
title = "Tree Species Richness Across the Eastern United States",
subtitle = "Each point is a FIA grid cell; color = number of unique species",
x = "Longitude",
y = "Latitude",
caption = "Source: USDA FIA Database"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "right")
Each step added one new idea. This is how you should build your plots: start simple, then layer on complexity.
The richness map reveals the latitudinal diversity gradient — one of the most fundamental patterns in biogeography. Species richness peaks in the southern Appalachians (~33–36°N) due to glacial refugia, topographic heterogeneity, and favorable moisture conditions.
Use the FIA data you loaded above. The live data frame
has the full eastern US; tn has Tennessee only.
Create a histogram of tree heights (
HT) for Tennessee. Use a bin width of 5 feet and fill the bars with a color of your choice. Add appropriate title and axis labels.
Use
geom_histogram(binwidth = 5, fill = "your_color", color = "white").
ggplot(tn, aes(x = HT)) +
geom_histogram(binwidth = 5, fill = "coral", color = "white") +
labs(title = "Distribution of Tree Heights in Tennessee",
x = "Height (ft)", y = "Count") +
theme_minimal()
Create boxplots comparing tree diameter (
DIA) across the top 5 most common species in Tennessee. Flip the coordinates for readability.
First find the top 5 species with count(). Then filter
and pipe into ggplot. Use coord_flip().
top5_sp <- tn %>% count(COMMON_NAME, sort = TRUE) %>% slice_head(n = 5) %>% pull(COMMON_NAME)
tn %>%
filter(COMMON_NAME %in% top5_sp) %>%
ggplot(aes(x = reorder(COMMON_NAME, DIA, FUN = median), y = DIA, fill = COMMON_NAME)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
labs(title = "Diameter Distribution of Top 5 Species (TN)",
x = NULL, y = "DBH (inches)") +
theme_minimal()
Create a scatter plot of
DIAvs.HT, faceted bySFTWD_HRDWD(softwood vs. hardwood). Add a loess trend line to each panel. Use the Tennessee data.
Use facet_wrap(~ SFTWD_HRDWD) and
geom_smooth(method = "loess").
ggplot(tn, aes(x = DIA, y = HT)) +
geom_point(alpha = 0.1, size = 0.5, color = "darkblue") +
geom_smooth(method = "loess", color = "red", se = FALSE) +
facet_wrap(~ SFTWD_HRDWD) +
labs(title = "Height vs. Diameter: Softwoods vs. Hardwoods",
x = "DBH (inches)", y = "Height (ft)") +
theme_minimal()
Calculate the mean diameter by species for the top 10 species in Tennessee. Create a horizontal bar chart ordered by value, with the mean diameter displayed next to each bar.
Combine group_by(), summarise(),
reorder(), geom_col(),
geom_text(), and coord_flip().
top10_sp <- tn %>% count(COMMON_NAME, sort = TRUE) %>% slice_head(n = 10) %>% pull(COMMON_NAME)
sp_dia <- tn %>%
filter(COMMON_NAME %in% top10_sp) %>%
group_by(COMMON_NAME) %>%
summarise(mean_DIA = mean(DIA, na.rm = TRUE))
ggplot(sp_dia, aes(x = reorder(COMMON_NAME, mean_DIA), y = mean_DIA)) +
geom_col(fill = "darkgreen") +
geom_text(aes(label = round(mean_DIA, 1)), hjust = -0.2, size = 3.5) +
coord_flip() +
labs(title = "Mean Diameter of Top 10 Species (TN)",
x = NULL, y = "Mean DBH (inches)") +
theme_minimal()
Create overlapping density plots of diameter (
DIA) for three species that represent different ecological strategies: red maple (shade-tolerant generalist), yellow-poplar (fast-growing gap specialist), and Virginia pine (early-successional conifer). Set transparency so all curves are visible. Which species tends to have the largest trees?
Use geom_density() with fill = COMMON_NAME
and alpha = 0.4.
three_sp <- c("red maple", "yellow-poplar", "Virginia pine")
tn %>%
filter(COMMON_NAME %in% three_sp) %>%
ggplot(aes(x = DIA, fill = COMMON_NAME)) +
geom_density(alpha = 0.4) +
labs(title = "Diameter Distribution by Species (TN)",
x = "DBH (inches)", y = "Density", fill = "Species") +
theme_minimal(base_size = 13) +
theme(legend.position = "bottom")
# Yellow-poplar tends to have the largest diameters — it is a fast-growing
# canopy tree that can reach impressive sizes in rich cove forests.
Using the full eastern US data (
live), create a map showing the geographic distribution of loblolly pine. For each grid cell where loblolly pine occurs, plot a point with color mapped to the number of loblolly pine stems in that cell. Usecoord_quickmap()for proper aspect ratio andscale_color_viridis_c()for color.
Filter live to loblolly pine, then
group_by(GRIDID, centroidLon, centroidLat) and count. Use
geom_point() with aes(color = n).
loblolly_grid <- live %>%
filter(COMMON_NAME == "loblolly pine") %>%
group_by(GRIDID, centroidLon, centroidLat) %>%
summarise(n = n(), .groups = "drop")
ggplot(loblolly_grid, aes(x = centroidLon, y = centroidLat, color = n)) +
geom_point(size = 1.2, alpha = 0.7) +
scale_color_viridis_c(option = "D", name = "# Stems") +
coord_quickmap() +
labs(title = "Geographic Distribution of Loblolly Pine",
subtitle = "Color intensity = abundance (stem count per grid cell)",
x = "Longitude", y = "Latitude") +
theme_minimal(base_size = 13)
| Geom | Use For | Key Arguments |
|---|---|---|
geom_histogram() |
Distribution of one variable | binwidth, fill,
color |
geom_density() |
Smooth distribution curves | alpha, fill |
geom_point() |
Scatter plots, maps | alpha, size,
color, shape |
geom_smooth() |
Trend lines | method, se,
linewidth |
geom_col() |
Bar charts (pre-computed values) | fill |
geom_bar() |
Bar charts (ggplot counts for you) | fill |
geom_boxplot() |
Compare distributions across groups | fill |
geom_line() |
Connect points in order | linewidth, linetype |
geom_text() |
Add text labels to a plot | label, hjust,
vjust, size |
| Helper | What It Does |
|---|---|
aes() |
Map data columns to visual properties |
labs() |
Set title, subtitle, axis labels, caption |
coord_flip() |
Swap x and y axes |
coord_quickmap() |
Aspect ratio for geographic maps |
facet_wrap() |
Split into panels by one variable |
facet_grid() |
Split into panels by two variables |
scale_color_viridis_c() |
Colorblind-friendly continuous color scale |
scale_color_brewer() |
Colorblind-friendly discrete color scale |
theme_minimal() |
Clean theme with minimal gridlines |
reorder(x, y) |
Reorder a factor by values of another variable |
Next step: In Lab 2, you will apply these ggplot2 skills to explore the full FIA dataset — visualizing diameter distributions, species abundance patterns, and geographic gradients across the eastern US.
End of Tutorial — ggplot2 Grammar of Graphics with FIA Data