This answer key shows the expected solutions for each question. Multiple correct approaches exist — if your code produces the right result, it receives full credit regardless of the specific functions used.


Getting Started

Before answering any questions, you need to load packages and read in the two data files:

library(readr)
library(dplyr)

tree <- read_csv("fia_eastern31_recent.csv", show_col_types = FALSE)
ref  <- read_csv("REF_SPECIES_trimmed.csv", show_col_types = FALSE)

Important: tree should have 2,866,808 rows and 40 columns. If you see ~1,048,576 rows, the file was opened in Excel (which has a row limit) rather than loaded in R.


Question 1: Vectors and Basic R (10 points)

(a) Create a numeric vector and compute summaries (3 pts)

The function c() creates a vector. Then we use mean(), max(), and length() to compute the requested summaries.

diameters <- c(5.2, 12.0, 8.7, 3.1, 15.4, 9.9, 22.6, 7.3)

mean(diameters)    # 10.525
max(diameters)     # 22.6
length(diameters)  # 8

Why these functions?

  • mean() calculates the arithmetic average of all values in the vector
  • max() returns the single largest value
  • length() counts how many elements the vector contains (8 values)

Common mistake: Passing individual numbers instead of the vector — mean(5.2, 12.0, 8.7, ...) does NOT work correctly because mean() only uses the first argument. You must first store values in a vector with c(), then pass the vector name.


(b) Logical subsetting (3 pts)

Square bracket notation [ ] combined with a logical condition lets us extract specific elements from a vector.

big <- diameters[diameters > 10]
big            # 12.0, 15.4, 22.6
length(big)    # 3

How this works step-by-step:

  1. diameters > 10 creates a logical vector: FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE
  2. diameters[...] uses that logical vector to select only the TRUE positions
  3. The result is a new vector containing only values greater than 10

Alternative: sum(diameters > 10) also counts the matches (TRUE counts as 1, FALSE as 0).


(c) Character vectors and indexing (4 pts)

states <- c("TN", "NC", "FL", "GA", "VA")

states[3]          # "FL"  — the 3rd element

states[4] <- "AL"  # replaces "GA" with "AL"

states             # "TN" "NC" "FL" "AL" "VA"

Key concepts:

  • R uses 1-based indexing — the first element is [1], not [0]
  • states[3] extracts the 3rd element
  • states[4] <- "AL" replaces the value at position 4 in-place
  • Printing states after replacement shows the updated vector

Question 2: Data Frame Inspection (15 points)

(a) Dimensions, column names, and data types (5 pts)

nrow(tree)   # 2,866,808 rows
ncol(tree)   # 40 columns
names(tree)  # all 40 column names

class(tree$STATE_ABBR)  # "character"
class(tree$DIA)         # "numeric"

Explanation:

  • nrow() and ncol() (or dim()) give the dimensions of a data frame
  • names() (or colnames()) lists all column names
  • class() tells you the data type of a column — accessed with $ notation
  • STATE_ABBR is character (text) because it stores state abbreviations like “TN”
  • DIA is numeric because it stores measurements (diameter in inches)

Also accepted: str(tree) gives all of this information in one command; typeof() instead of class().


(b) Summary statistics on DIA (5 pts)

summary(tree$DIA)

This produces a six-number summary plus NA count:

Statistic Value
Min 1.00
1st Quartile 4.10
Median 7.00
Mean 8.01
3rd Quartile 10.50
Max 67.00
NA’s 391,409

Interpretation:

  • The median diameter is 7.0 inches — half of all trees are smaller, half are larger
  • The minimum is 1.0 inch (smallest measured tree) and maximum is 67.0 inches (a very large tree)
  • There are 391,409 NA values — these are rows where diameter was not measured. The summary() output automatically reports NAs at the end, which is how you can tell they exist

(c) Display specific columns (5 pts)

# Option 1: Base R bracket notation
head(tree[, c("STATE_ABBR", "SPCD", "DIA", "HT")], 10)

# Option 2: dplyr
tree %>% select(STATE_ABBR, SPCD, DIA, HT) %>% head(10)

Both approaches work. The key is selecting exactly those 4 columns and limiting to 10 rows (not the default 6 from head()).


Question 3: Modifying Data with dplyr (20 points)

(a) Join and filter (5 pts)

First, we join the species reference table to the tree data so each tree gets species information (like common name). Then we filter to only live trees in Tennessee.

tree <- tree %>% left_join(ref, by = "SPCD")

tn_live <- tree %>%
  filter(STATE_ABBR == "TN", STATUSCD == 1)

nrow(tn_live)  # 77,994

Why left_join? It keeps ALL rows from tree and adds matching columns from ref. Trees whose SPCD doesn’t appear in ref will get NA for the new columns (like COMMON_NAME). We join by SPCD because that’s the shared column (species code) between the two tables.

Why two filter conditions? STATE_ABBR == "TN" keeps only Tennessee trees, and STATUSCD == 1 keeps only live trees (2 = standing dead). Using a comma between conditions is equivalent to & (AND).


(b) Mutate — adding calculated columns (5 pts)

tn_live <- tn_live %>%
  mutate(
    DIA_cm  = DIA * 2.54,
    BA_sqft = pi / 4 * (DIA / 12)^2
  )

tn_live %>%
  select(COMMON_NAME, DIA, DIA_cm, BA_sqft) %>%
  head(6)

Understanding the basal area formula:

\[BA = \frac{\pi}{4} \times \left(\frac{DIA}{12}\right)^2\]

  • DIA is in inches, but basal area is in square feet
  • Dividing DIA by 12 converts inches to feet
  • pi / 4 * (DIA/12)^2 computes the cross-sectional area of the tree trunk
  • In R, pi is a built-in constant (3.14159…)

Common mistake: Using pi * 4 instead of pi / 4 — this gives the inverse of the correct value and is a conceptually different formula.

Important: The mutate result must be saved (assigned back with <-) if you want to use the new columns later. If you just write tn_live %>% mutate(...) without assigning, the result prints to console but isn’t stored.


(c) Chained pipe command (5 pts)

tn_live %>%
  filter(DIA > 20) %>%
  select(COMMON_NAME, DIA, HT) %>%
  arrange(desc(DIA)) %>%
  head(10)

Reading the pipe (%>%) step by step:

  1. Start with tn_live (77,994 rows of TN live trees)
  2. filter(DIA > 20) → keep only trees with diameter > 20 inches (large trees)
  3. select(...) → keep only these 3 columns
  4. arrange(desc(DIA)) → sort by diameter, largest first (desc() = descending)
  5. head(10) → show only the top 10 rows

This is a “pipeline” — data flows from one operation to the next. Each %>% passes the result to the next function. Also accepted: slice_head(n = 10) instead of head(10).


(d) Counting unique species (5 pts)

# Option 1
n_distinct(tn_live$COMMON_NAME)  # 125

# Option 2
tn_live %>% distinct(COMMON_NAME) %>% nrow()

# Option 3
length(unique(tn_live$COMMON_NAME))

All three approaches give 125 unique species among live trees in Tennessee. Any of these is fully correct.


Question 4: Collapsing Data — Group Summaries (20 points)

(a) State-level summary (7 pts)

state_summary <- tree %>%
  filter(STATUSCD == 1) %>%
  group_by(STATE_ABBR) %>%
  summarise(
    n_trees   = n(),
    n_species = n_distinct(SPCD),
    mean_DIA  = mean(DIA, na.rm = TRUE)
  ) %>%
  arrange(desc(n_species))

state_summary

Georgia (GA) has the most species with 142 unique species codes.

How group_by + summarise works:

  1. group_by(STATE_ABBR) splits the data into 31 groups (one per state)
  2. summarise() collapses each group into one summary row
  3. n() counts rows in each group, n_distinct() counts unique values
  4. na.rm = TRUE tells mean() to ignore NA values — without this, any group with NAs would return NA for the mean
  5. arrange(desc(n_species)) sorts the result so the state with the most species appears first

(b) Top 5 species in Tennessee (7 pts)

tree %>%
  filter(STATUSCD == 1, STATE_ABBR == "TN") %>%
  group_by(COMMON_NAME) %>%
  summarise(
    n        = n(),
    mean_DIA = mean(DIA, na.rm = TRUE),
    mean_HT  = mean(HT, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  slice_head(n = 5)

Expected top 5: red maple, yellow-poplar, loblolly pine, white oak, chestnut oak (these are Tennessee’s most abundant tree species by stem count).

Note: slice_head(n = 5) or head(5) limits the output to the top 5 rows after sorting.


(c) Summary by major species group (6 pts)

tree %>%
  filter(STATUSCD == 1) %>%
  group_by(MAJOR_SPGRPCD) %>%
  summarise(
    n_trees  = n(),
    mean_DIA = mean(DIA, na.rm = TRUE)
  )
MAJOR_SPGRPCD Group n_trees mean_DIA
1 Pine ~439,406 ~8.84
2 Other softwood ~308,074 ~6.75
3 Soft hardwood ~724,188 ~7.38
4 Hard hardwood ~696,799 ~8.27

Bonus: Adding labels with mutate(group_name = case_when(...)) makes the output more readable, but was not required.


Question 5: Merging and Appending (20 points)

(a) Concept: left_join vs. inner_join (5 pts)

# left_join keeps ALL rows from the left table (tree).
# If a row in tree has no match in ref, the new columns are filled with NA.
#
# inner_join keeps ONLY rows that have matches in BOTH tables.
# Unmatched rows are dropped entirely.
#
# With 100 tree rows and 3 unmatched SPCD:
#   left_join(tree, ref)  → 100 rows (all kept; 3 have NA for COMMON_NAME)
#   inner_join(tree, ref) →  97 rows (3 unmatched rows are dropped)
#
# In left_join, the COMMON_NAME column for those 3 unmatched rows = NA

Visual explanation:

  • left_join = “Keep everything from the left, add what matches from the right”
  • inner_join = “Only keep what matches in both”

The key insight: left_join preserves all your original data (filling gaps with NA), while inner_join drops unmatched rows.


(b) anti_join to find unmatched species (5 pts)

unused <- ref %>%
  anti_join(tree, by = "SPCD")

nrow(unused)  # 2,375

2,375 species in the reference table have zero trees in our eastern US dataset.

Why? The reference table (ref) is a national species catalog containing western US species (e.g., Douglas-fir, giant sequoia), tropical species, and rare species that simply don’t occur in the 31 eastern states in our dataset.

Critical: Direction matters! ref %>% anti_join(tree) finds species in ref NOT in tree. If you reverse it — tree %>% anti_join(ref) — you’d find trees whose SPCD is NOT in ref, which gives a different (likely 0) result.


(c) bind_rows to combine data frames (5 pts)

fl_trees <- tree %>% filter(STATE_ABBR == "FL", STATUSCD == 1)
ga_trees <- tree %>% filter(STATE_ABBR == "GA", STATUSCD == 1)

fl_ga <- bind_rows(fl_trees, ga_trees)

# Verify:
nrow(fl_ga)                              # 208,819
nrow(fl_trees) + nrow(ga_trees)          # 208,819 (should match)
nrow(fl_ga) == nrow(fl_trees) + nrow(ga_trees)  # TRUE

bind_rows() stacks data frames vertically (appending rows). It’s like stacking spreadsheets on top of each other. The total rows must equal the sum of the individual data frames.

Also accepted: rbind() works the same way for data frames with identical column structures.


(d) Joining custom data with state summary (5 pts)

state_info <- data.frame(
  STATE_ABBR = c("TN", "NC", "FL", "GA", "VA"),
  region = c("Mid-South", "Southeast", "Southeast", "Southeast", "Mid-Atlantic")
)

# Step 1: One row per state with live tree count
state_counts <- tree %>%
  filter(STATUSCD == 1) %>%
  count(STATE_ABBR, name = "n_trees")
  # Alternative: group_by(STATE_ABBR) %>% summarise(n_trees = n())

# Step 2: Join with left_join to keep all 31 states
state_counts %>%
  left_join(state_info, by = "STATE_ABBR")

Why left_join? Because state_info only has 5 states, but our summary has 31. Using left_join keeps all 31 states in the result. States not in state_info (like “ME”, “WI”, etc.) will show NA in the region column.

If we used inner_join instead, we’d lose 26 states — only the 5 matching states would remain.


Question 6: Putting It All Together (15 points)

This challenge question combines filtering, counting, subsetting, group summaries, and interpretation into one multi-step pipeline.

Complete solution

# Step 1: Identify the 10 most common species by tree count
top10 <- tree %>%
  filter(STATUSCD == 1) %>%
  count(COMMON_NAME, sort = TRUE) %>%
  slice_head(n = 10) %>%
  pull(COMMON_NAME)

# Step 2: Filter to those species, summarise, and arrange
result <- tree %>%
  filter(STATUSCD == 1, COMMON_NAME %in% top10) %>%
  group_by(COMMON_NAME) %>%
  summarise(
    n_trees   = n(),
    mean_lat  = mean(LAT, na.rm = TRUE),
    mean_DIA  = mean(DIA, na.rm = TRUE),
    wood_type = first(SFTWD_HRDWD)
  ) %>%
  arrange(desc(mean_lat))

result

Step-by-step breakdown:

Step 1: Finding the top 10 species

  • filter(STATUSCD == 1) — only live trees
  • count(COMMON_NAME, sort = TRUE) — counts trees per species and sorts largest first
  • slice_head(n = 10) — keeps only the top 10
  • pull(COMMON_NAME) — extracts the names as a character vector (not a data frame)

The result top10 is a vector like: "loblolly pine", "red maple", "balsam fir", ...

Step 2: Summarising those species

  • COMMON_NAME %in% top10 — keeps only trees belonging to the top-10 species
  • group_by(COMMON_NAME) + summarise(...) — calculates stats for each species
  • first(SFTWD_HRDWD) — takes the wood type (each species has only one)
  • arrange(desc(mean_lat)) — highest latitude species first

Expected output and interpretation:

COMMON_NAME n_trees mean_lat mean_DIA wood_type
quaking aspen ~109,704 ~46.5 ~6.4 H
northern white-cedar ~55,836 ~46.2 ~8.0 S
balsam fir ~125,252 ~45.9 ~5.1 S
sugar maple ~111,852 ~43.0 ~8.4 H
American beech ~61,277 ~41.7 ~6.7 H
red maple ~240,552 ~40.6 ~7.4 H
white oak ~71,926 ~37.4 ~10.3 H
yellow-poplar ~63,357 ~36.3 ~9.8 H
sweetgum ~111,050 ~33.8 ~6.6 H
loblolly pine ~363,677 ~33.5 ~8.6 S
  • Highest mean latitude: quaking aspen (~46.5°N) — a northern hardwood
  • Lowest mean latitude: loblolly pine (~33.5°N) — a southern softwood
  • Pattern: Softwood species (S) like loblolly pine and balsam fir tend to appear at both the highest and lowest latitudes (balsam fir is northern, loblolly is southern). Hardwoods (H) dominate the middle latitudes. The most common species overall (loblolly pine, red maple) reflect the dominance of southeastern forests in total stem counts.

Alternative approaches that received full credit:

  • Using semi_join() instead of pull() + %in%
  • Using inner_join() with the top-10 data frame to filter
  • Using head(10) + inner_join(tree) as a creative filtering technique
  • Using slice(1:10) instead of slice_head(n = 10)

The key was demonstrating the multi-step pipeline: identify top species → filter to those species → summarise → sort → interpret.


If you have questions about the grading of your specific submission, please come to office hours.