This answer key shows the expected solutions for each question. Multiple correct approaches exist — if your code produces the right result, it receives full credit regardless of the specific functions used.
Before answering any questions, you need to load packages and read in the two data files:
library(readr)
library(dplyr)
tree <- read_csv("fia_eastern31_recent.csv", show_col_types = FALSE)
ref <- read_csv("REF_SPECIES_trimmed.csv", show_col_types = FALSE)
Important: tree should have
2,866,808 rows and 40 columns. If you
see ~1,048,576 rows, the file was opened in Excel (which has a row
limit) rather than loaded in R.
The function c() creates a vector. Then we use
mean(), max(), and length() to
compute the requested summaries.
diameters <- c(5.2, 12.0, 8.7, 3.1, 15.4, 9.9, 22.6, 7.3)
mean(diameters) # 10.525
max(diameters) # 22.6
length(diameters) # 8
Why these functions?
mean() calculates the arithmetic average of all values
in the vectormax() returns the single largest valuelength() counts how many elements the vector contains
(8 values)Common mistake: Passing individual numbers instead
of the vector — mean(5.2, 12.0, 8.7, ...) does NOT work
correctly because mean() only uses the first argument. You
must first store values in a vector with c(), then pass the
vector name.
Square bracket notation [ ] combined with a logical
condition lets us extract specific elements from a vector.
big <- diameters[diameters > 10]
big # 12.0, 15.4, 22.6
length(big) # 3
How this works step-by-step:
diameters > 10 creates a logical vector:
FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSEdiameters[...] uses that logical vector to select only
the TRUE positionsAlternative: sum(diameters > 10)
also counts the matches (TRUE counts as 1, FALSE as 0).
states <- c("TN", "NC", "FL", "GA", "VA")
states[3] # "FL" — the 3rd element
states[4] <- "AL" # replaces "GA" with "AL"
states # "TN" "NC" "FL" "AL" "VA"
Key concepts:
[1], not [0]states[3] extracts the 3rd elementstates[4] <- "AL" replaces the value at position 4
in-placestates after replacement shows the updated
vectornrow(tree) # 2,866,808 rows
ncol(tree) # 40 columns
names(tree) # all 40 column names
class(tree$STATE_ABBR) # "character"
class(tree$DIA) # "numeric"
Explanation:
nrow() and ncol() (or dim())
give the dimensions of a data framenames() (or colnames()) lists all column
namesclass() tells you the data type of a column — accessed
with $ notationSTATE_ABBR is character (text) because it stores state
abbreviations like “TN”DIA is numeric because it stores measurements (diameter
in inches)Also accepted: str(tree) gives all of
this information in one command; typeof() instead of
class().
summary(tree$DIA)
This produces a six-number summary plus NA count:
| Statistic | Value |
|---|---|
| Min | 1.00 |
| 1st Quartile | 4.10 |
| Median | 7.00 |
| Mean | 8.01 |
| 3rd Quartile | 10.50 |
| Max | 67.00 |
| NA’s | 391,409 |
Interpretation:
summary() output
automatically reports NAs at the end, which is how you can tell they
exist# Option 1: Base R bracket notation
head(tree[, c("STATE_ABBR", "SPCD", "DIA", "HT")], 10)
# Option 2: dplyr
tree %>% select(STATE_ABBR, SPCD, DIA, HT) %>% head(10)
Both approaches work. The key is selecting exactly those 4
columns and limiting to 10 rows (not the
default 6 from head()).
First, we join the species reference table to the tree data so each tree gets species information (like common name). Then we filter to only live trees in Tennessee.
tree <- tree %>% left_join(ref, by = "SPCD")
tn_live <- tree %>%
filter(STATE_ABBR == "TN", STATUSCD == 1)
nrow(tn_live) # 77,994
Why left_join? It keeps ALL rows from
tree and adds matching columns from ref. Trees
whose SPCD doesn’t appear in ref will get NA
for the new columns (like COMMON_NAME). We join by SPCD
because that’s the shared column (species code) between the two
tables.
Why two filter conditions?
STATE_ABBR == "TN" keeps only Tennessee trees, and
STATUSCD == 1 keeps only live trees (2 = standing dead).
Using a comma between conditions is equivalent to &
(AND).
tn_live <- tn_live %>%
mutate(
DIA_cm = DIA * 2.54,
BA_sqft = pi / 4 * (DIA / 12)^2
)
tn_live %>%
select(COMMON_NAME, DIA, DIA_cm, BA_sqft) %>%
head(6)
Understanding the basal area formula:
\[BA = \frac{\pi}{4} \times \left(\frac{DIA}{12}\right)^2\]
DIA is in inches, but basal area is in
square feetpi / 4 * (DIA/12)^2 computes the cross-sectional area
of the tree trunkpi is a built-in constant (3.14159…)Common mistake: Using pi * 4 instead of
pi / 4 — this gives the inverse of the
correct value and is a conceptually different formula.
Important: The mutate result must be
saved (assigned back with <-) if you
want to use the new columns later. If you just write
tn_live %>% mutate(...) without assigning, the result
prints to console but isn’t stored.
tn_live %>%
filter(DIA > 20) %>%
select(COMMON_NAME, DIA, HT) %>%
arrange(desc(DIA)) %>%
head(10)
Reading the pipe (%>%) step by
step:
tn_live (77,994 rows of TN live trees)filter(DIA > 20) → keep only trees with diameter
> 20 inches (large trees)select(...) → keep only these 3 columnsarrange(desc(DIA)) → sort by diameter, largest first
(desc() = descending)head(10) → show only the top 10 rowsThis is a “pipeline” — data flows from one operation to the next.
Each %>% passes the result to the next function.
Also accepted: slice_head(n = 10) instead
of head(10).
# Option 1
n_distinct(tn_live$COMMON_NAME) # 125
# Option 2
tn_live %>% distinct(COMMON_NAME) %>% nrow()
# Option 3
length(unique(tn_live$COMMON_NAME))
All three approaches give 125 unique species among live trees in Tennessee. Any of these is fully correct.
state_summary <- tree %>%
filter(STATUSCD == 1) %>%
group_by(STATE_ABBR) %>%
summarise(
n_trees = n(),
n_species = n_distinct(SPCD),
mean_DIA = mean(DIA, na.rm = TRUE)
) %>%
arrange(desc(n_species))
state_summary
Georgia (GA) has the most species with 142 unique species codes.
How group_by + summarise
works:
group_by(STATE_ABBR) splits the data into 31 groups
(one per state)summarise() collapses each group into one summary
rown() counts rows in each group,
n_distinct() counts unique valuesna.rm = TRUE tells mean() to ignore NA
values — without this, any group with NAs would return NA for the
meanarrange(desc(n_species)) sorts the result so the state
with the most species appears firsttree %>%
filter(STATUSCD == 1, STATE_ABBR == "TN") %>%
group_by(COMMON_NAME) %>%
summarise(
n = n(),
mean_DIA = mean(DIA, na.rm = TRUE),
mean_HT = mean(HT, na.rm = TRUE)
) %>%
arrange(desc(n)) %>%
slice_head(n = 5)
Expected top 5: red maple, yellow-poplar, loblolly pine, white oak, chestnut oak (these are Tennessee’s most abundant tree species by stem count).
Note: slice_head(n = 5) or
head(5) limits the output to the top 5 rows after
sorting.
tree %>%
filter(STATUSCD == 1) %>%
group_by(MAJOR_SPGRPCD) %>%
summarise(
n_trees = n(),
mean_DIA = mean(DIA, na.rm = TRUE)
)
| MAJOR_SPGRPCD | Group | n_trees | mean_DIA |
|---|---|---|---|
| 1 | Pine | ~439,406 | ~8.84 |
| 2 | Other softwood | ~308,074 | ~6.75 |
| 3 | Soft hardwood | ~724,188 | ~7.38 |
| 4 | Hard hardwood | ~696,799 | ~8.27 |
Bonus: Adding labels with
mutate(group_name = case_when(...)) makes the output more
readable, but was not required.
# left_join keeps ALL rows from the left table (tree).
# If a row in tree has no match in ref, the new columns are filled with NA.
#
# inner_join keeps ONLY rows that have matches in BOTH tables.
# Unmatched rows are dropped entirely.
#
# With 100 tree rows and 3 unmatched SPCD:
# left_join(tree, ref) → 100 rows (all kept; 3 have NA for COMMON_NAME)
# inner_join(tree, ref) → 97 rows (3 unmatched rows are dropped)
#
# In left_join, the COMMON_NAME column for those 3 unmatched rows = NA
Visual explanation:
The key insight: left_join preserves all your original
data (filling gaps with NA), while inner_join drops
unmatched rows.
unused <- ref %>%
anti_join(tree, by = "SPCD")
nrow(unused) # 2,375
2,375 species in the reference table have zero trees in our eastern US dataset.
Why? The reference table (ref) is a
national species catalog containing western US species (e.g.,
Douglas-fir, giant sequoia), tropical species, and rare species that
simply don’t occur in the 31 eastern states in our dataset.
Critical: Direction matters!
ref %>% anti_join(tree) finds species in
ref NOT in tree. If you reverse it —
tree %>% anti_join(ref) — you’d find trees whose SPCD is
NOT in ref, which gives a different (likely 0) result.
fl_trees <- tree %>% filter(STATE_ABBR == "FL", STATUSCD == 1)
ga_trees <- tree %>% filter(STATE_ABBR == "GA", STATUSCD == 1)
fl_ga <- bind_rows(fl_trees, ga_trees)
# Verify:
nrow(fl_ga) # 208,819
nrow(fl_trees) + nrow(ga_trees) # 208,819 (should match)
nrow(fl_ga) == nrow(fl_trees) + nrow(ga_trees) # TRUE
bind_rows() stacks data frames
vertically (appending rows). It’s like stacking spreadsheets on top of
each other. The total rows must equal the sum of the individual data
frames.
Also accepted: rbind() works the same
way for data frames with identical column structures.
state_info <- data.frame(
STATE_ABBR = c("TN", "NC", "FL", "GA", "VA"),
region = c("Mid-South", "Southeast", "Southeast", "Southeast", "Mid-Atlantic")
)
# Step 1: One row per state with live tree count
state_counts <- tree %>%
filter(STATUSCD == 1) %>%
count(STATE_ABBR, name = "n_trees")
# Alternative: group_by(STATE_ABBR) %>% summarise(n_trees = n())
# Step 2: Join with left_join to keep all 31 states
state_counts %>%
left_join(state_info, by = "STATE_ABBR")
Why left_join? Because
state_info only has 5 states, but our summary has 31. Using
left_join keeps all 31 states in the result. States not in
state_info (like “ME”, “WI”, etc.) will show
NA in the region column.
If we used inner_join instead, we’d lose 26 states —
only the 5 matching states would remain.
This challenge question combines filtering, counting, subsetting, group summaries, and interpretation into one multi-step pipeline.
# Step 1: Identify the 10 most common species by tree count
top10 <- tree %>%
filter(STATUSCD == 1) %>%
count(COMMON_NAME, sort = TRUE) %>%
slice_head(n = 10) %>%
pull(COMMON_NAME)
# Step 2: Filter to those species, summarise, and arrange
result <- tree %>%
filter(STATUSCD == 1, COMMON_NAME %in% top10) %>%
group_by(COMMON_NAME) %>%
summarise(
n_trees = n(),
mean_lat = mean(LAT, na.rm = TRUE),
mean_DIA = mean(DIA, na.rm = TRUE),
wood_type = first(SFTWD_HRDWD)
) %>%
arrange(desc(mean_lat))
result
Step 1: Finding the top 10 species
filter(STATUSCD == 1) — only live treescount(COMMON_NAME, sort = TRUE) — counts trees per
species and sorts largest firstslice_head(n = 10) — keeps only the top 10pull(COMMON_NAME) — extracts the names as a character
vector (not a data frame)The result top10 is a vector like:
"loblolly pine", "red maple", "balsam fir", ...
Step 2: Summarising those species
COMMON_NAME %in% top10 — keeps only trees belonging to
the top-10 speciesgroup_by(COMMON_NAME) + summarise(...) —
calculates stats for each speciesfirst(SFTWD_HRDWD) — takes the wood type (each species
has only one)arrange(desc(mean_lat)) — highest latitude species
first| COMMON_NAME | n_trees | mean_lat | mean_DIA | wood_type |
|---|---|---|---|---|
| quaking aspen | ~109,704 | ~46.5 | ~6.4 | H |
| northern white-cedar | ~55,836 | ~46.2 | ~8.0 | S |
| balsam fir | ~125,252 | ~45.9 | ~5.1 | S |
| sugar maple | ~111,852 | ~43.0 | ~8.4 | H |
| American beech | ~61,277 | ~41.7 | ~6.7 | H |
| red maple | ~240,552 | ~40.6 | ~7.4 | H |
| white oak | ~71,926 | ~37.4 | ~10.3 | H |
| yellow-poplar | ~63,357 | ~36.3 | ~9.8 | H |
| sweetgum | ~111,050 | ~33.8 | ~6.6 | H |
| loblolly pine | ~363,677 | ~33.5 | ~8.6 | S |
semi_join() instead of pull() +
%in%inner_join() with the top-10 data frame to
filterhead(10) + inner_join(tree) as a
creative filtering techniqueslice(1:10) instead of
slice_head(n = 10)The key was demonstrating the multi-step pipeline: identify top species → filter to those species → summarise → sort → interpret.
If you have questions about the grading of your specific submission, please come to office hours.