Biogeography R Midterm Exam

Name: ____________________

Time: 1 hour 25 minutes | Total: 100 points

Rules:

RStudio only — no internet browser.
You may use help() or ? in the console (e.g., ?left_join, help(filter)).
Create a new R script (.R) or R Markdown (.Rmd) file for your answers.
Comment your code. Partial credit will be given for correct logic even with minor syntax errors.
Two data files are in your working directory:
- fia_eastern31_recent.csv — tree-level FIA data (~2.87 million rows, 40 columns)
- REF_SPECIES_trimmed.csv — species reference table (2,677 rows, 9 columns)

Quick Reference: Data Files

`fia_eastern31_recent.csv` — Tree Data (each row = one tree)

Variable	Type	Description
`TREE_CN`	numeric	Unique tree identifier
`PLT_CN`	numeric	Unique plot identifier
`STATECD`	integer	State FIPS code (e.g., 47 = Tennessee)
`STATE_ABBR`	character	Two-letter state abbreviation (e.g., “TN”, “FL”)
`SPCD`	integer	Species code — key to join with REF_SPECIES
`DIA`	numeric	Diameter at breast height (inches)
`HT`	numeric	Total tree height (feet)
`STATUSCD`	integer	1 = live tree, 2 = standing dead
`LAT`	numeric	Plot latitude (decimal degrees)
`LON`	numeric	Plot longitude (decimal degrees, negative)
`BALIVE`	numeric	Basal area of live trees on the plot (sq ft/acre)
`MAJOR_SPGRPCD`	integer	1=Pine, 2=Other softwood, 3=Soft hardwood, 4=Hard hardwood

`REF_SPECIES_trimmed.csv` — Species Reference (each row = one species)

Variable	Type	Description
`SPCD`	integer	Species code — key to join with tree data
`COMMON_NAME`	character	Common name (e.g., “red maple”, “loblolly pine”)
`GENUS`	character	Genus name (e.g., “Acer”, “Pinus”)
`SPECIES`	character	Species epithet (e.g., “rubrum”, “taeda”)
`SPECIES_SYMBOL`	character	USDA PLANTS symbol (e.g., “ACRU”)
`E_SPGRPCD`	integer	Eastern species group code
`MAJOR_SPGRPCD`	integer	1=Pine, 2=Other softwood, 3=Soft hardwood, 4=Hard hardwood
`SFTWD_HRDWD`	character	“S” = softwood, “H” = hardwood
`WOODLAND`	character	“Y” if woodland species

Getting started: Load your packages and read the data first.

library(readr)   # or library(data.table)
library(dplyr)

tree <- read_csv("fia_eastern31_recent.csv", show_col_types = FALSE)
ref  <- read_csv("REF_SPECIES_trimmed.csv", show_col_types = FALSE)

Question 1: Vectors and Basic R (10 points)

(a) (3 pts) Create a numeric vector called diameters containing these 8 values:

5.2, 12.0, 8.7, 3.1, 15.4, 9.9, 22.6, 7.3

Calculate the mean, maximum, and length of this vector.

Hint: use c() to create vectors; mean(), max(), length() for summaries.

(b) (3 pts) Using logical subsetting on diameters, extract only values greater than 10. How many values meet this condition?

Hint: vector_name[vector_name > value] extracts elements meeting a condition.

(c) (4 pts) Create a character vector called states containing: "TN", "NC", "FL", "GA", "VA".

What is the 3rd element?
Replace the 4th element with "AL".
Print the updated vector.

Hint: use c() with quoted strings; access elements with [ ] indexing.

Question 2: Data Frame Inspection (15 points)

(a) (5 pts) After reading fia_eastern31_recent.csv into an object called tree, answer:

How many rows and how many columns does the tree data have?
List all column names.
What data type is the STATE_ABBR column? What data type is DIA?

Hint: useful functions include nrow(), ncol(), names(), str(), class().

(b) (5 pts) Use summary() on the DIA column (diameter at breast height, in inches):

What is the median diameter?
What is the minimum and maximum?
Are there any NA values? How can you tell from the summary output?

Hint: summary(tree$DIA) produces a six-number summary. If NAs exist, they appear as “NA’s: xxx” at the end.

(c) (5 pts) Display the first 10 rows of just these four columns: STATE_ABBR, SPCD, DIA, HT.

Hint: you can use base R bracket notation data[rows, columns] with head(), or dplyr’s select() with head().

Question 3: Modifying Data with dplyr (20 points)

(a) (5 pts) First, join the species reference table to the tree data so that each tree gets its species name. Use SPCD as the key:

tree <- tree %>% left_join(ref, by = "SPCD")

Now, using filter(), create a new data frame called tn_live that contains only live trees in Tennessee. How many rows does tn_live have?

Variables to use: STATE_ABBR (value: "TN" for Tennessee), STATUSCD (value: 1 for live trees).

(b) (5 pts) Using mutate(), add two new columns to tn_live:

DIA_cm: diameter converted from inches to centimeters (multiply DIA by 2.54)
BA_sqft: individual tree basal area in square feet, calculated as: \[BA = \frac{\pi}{4} \times \left(\frac{DIA}{12}\right)^2\] (Note: DIA is in inches; dividing by 12 converts to feet. Use pi for π in R.)

Show the first 6 rows of COMMON_NAME, DIA, DIA_cm, and BA_sqft.

Variables to use: DIA (diameter in inches, from tree data), COMMON_NAME (species name, from ref after join).

(c) (5 pts) Write a single chained command using the pipe operator (%>%) that:

Starts with tn_live
Filters to trees with DIA > 20 inches
Selects only COMMON_NAME, DIA, and HT
Arranges by DIA in descending order
Shows the top 10 rows

Variables to use: DIA (diameter, inches), HT (height, feet), COMMON_NAME (species name).

Hint: arrange(desc(DIA)) sorts largest first; head(10) or slice_head(n = 10) limits output.

(d) (5 pts) How many unique species (unique COMMON_NAME values) are among live trees in Tennessee?

Hint: you can use distinct(), n_distinct(), or length(unique()).

Question 4: Collapsing Data — Group Summaries (20 points)

(a) (7 pts) Using the full tree data (all 31 states, live trees only: STATUSCD == 1), calculate the following summary statistics for each state (STATE_ABBR):

n_trees: total number of live trees (hint: n())
n_species: number of unique species (hint: n_distinct(SPCD))
mean_DIA: mean diameter (hint: mean(DIA, na.rm = TRUE))

Sort the result by n_species in descending order. Which state has the most species?

Variables to use: STATUSCD (1 = live), STATE_ABBR, SPCD, DIA.

(b) (7 pts) For live trees in Tennessee only (STATE_ABBR == "TN", STATUSCD == 1), find the top 5 most common species by tree count. Your output should show:

COMMON_NAME: species name
n: number of trees (hint: n())
mean_DIA: mean diameter in inches (hint: mean(DIA, na.rm = TRUE))
mean_HT: mean height in feet (hint: mean(HT, na.rm = TRUE))

Sort by count (largest first) and display only the top 5.

Variables to use: STATE_ABBR, STATUSCD, COMMON_NAME, DIA, HT.

(c) (6 pts) Using all 31-state live tree data (STATUSCD == 1), calculate the number of live trees and mean diameter for each MAJOR_SPGRPCD group.

For reference, the MAJOR_SPGRPCD codes are:

Code	Group
1	Pine
2	Other softwood
3	Soft hardwood
4	Hard hardwood

Variables to use: STATUSCD, MAJOR_SPGRPCD, DIA.

Question 5: Merging and Appending (20 points)

(a) (5 pts) Concept question. In your own words (as code comments), explain the difference between left_join() and inner_join().

Suppose the tree data has 100 rows and 3 of those rows have an SPCD value that does not exist in the reference table (ref):

How many rows would left_join(tree, ref, by = "SPCD") return?
How many rows would inner_join(tree, ref, by = "SPCD") return?
What happens to the COMMON_NAME column for those 3 unmatched rows in a left_join?

Write your answers as comments (#) in your script.

Variables involved: SPCD (the join key in both files), COMMON_NAME (from ref).

(b) (5 pts) Use anti_join() to find species codes (SPCD) in the reference table (ref) that are not present in the tree data (tree). How many species in the reference table have zero trees in our eastern US dataset? Why might this be? (Answer the “why” as a comment.)

Hint: anti_join(A, B, by = "key") returns rows in A that have no match in B. Direction matters!

Variables to use: SPCD (the join key in both files).

(c) (5 pts) Create two separate data frames:

fl_trees: live trees in Florida (STATE_ABBR == "FL", STATUSCD == 1)
ga_trees: live trees in Georgia (STATE_ABBR == "GA", STATUSCD == 1)

Use bind_rows() to combine (append) them into a single data frame called fl_ga. Verify that the number of rows in fl_ga equals nrow(fl_trees) + nrow(ga_trees).

Variables to use: STATE_ABBR, STATUSCD.

Hint: bind_rows() stacks data frames on top of each other (adds rows). This is “appending.”

(d) (5 pts) Suppose you have the following small data frame of state-level information:

state_info <- data.frame(
  STATE_ABBR = c("TN", "NC", "FL", "GA", "VA"),
  region = c("Mid-South", "Southeast", "Southeast", "Southeast", "Mid-Atlantic")
)

Write code to:

First, create a summary with one row per state showing the total number of live trees (STATUSCD == 1) per state. (Hint: use count() or group_by() + summarise(n = n()))
Then, join state_info onto that summary using STATE_ABBR as the key.

Answer in a comment: Which type of join should you use so that all 31 states remain in the result, even those not listed in state_info? What will the region column show for states like “ME” or “WI” that are not in state_info?

Variables to use: STATE_ABBR (the join key), STATUSCD.

Question 6: Putting It All Together (15 points)

This is more challenging — take your time and build it step by step.

A colleague asks: “Among the 10 most common tree species in the eastern US, which species tends to grow at the highest latitudes and which at the lowest?”

Write a complete pipeline that:

Starts with the full tree data (already joined with ref).
Filters to live trees only (STATUSCD == 1).
Identifies the 10 most common species by total tree count. Use COMMON_NAME to identify species. (Hint: count(COMMON_NAME, sort = TRUE) then slice_head(n = 10), then pull(COMMON_NAME) to extract as a vector.)
Filters the tree data to include only trees belonging to those top-10 species. (Hint: COMMON_NAME %in% top10)
For each of those 10 species (group_by(COMMON_NAME)), calculates:
- n_trees: total number of trees — n()
- mean_lat: mean latitude — mean(LAT, na.rm = TRUE)
- mean_DIA: mean diameter — mean(DIA, na.rm = TRUE)
- wood_type: softwood or hardwood — first(SFTWD_HRDWD)
Arranges the result by mean_lat in descending order (highest latitude first).
Displays the final table.

Then, answer in a comment: Which species has the highest mean latitude? Which has the lowest? Is there a pattern between softwood/hardwood and latitude?

Variables to use: STATUSCD (1 = live), COMMON_NAME (species name), LAT (latitude in decimal degrees), DIA (diameter in inches), SFTWD_HRDWD (“S” = softwood, “H” = hardwood).

— END OF EXAM —

Save your file and make sure your name is at the top. Good luck!

Biogeography R Midterm Exam

ESCI 4/6241 — Spring 2026

Tuesday, April 1, 2026

Quick Reference: Data Files

`fia_eastern31_recent.csv` — Tree Data (each row = one tree)

`REF_SPECIES_trimmed.csv` — Species Reference (each row = one species)

Question 1: Vectors and Basic R (10 points)

Question 2: Data Frame Inspection (15 points)

Question 3: Modifying Data with dplyr (20 points)

Question 4: Collapsing Data — Group Summaries (20 points)

Question 5: Merging and Appending (20 points)

Question 6: Putting It All Together (15 points)

Biogeography R Midterm Exam

ESCI 4/6241 — Spring 2026

Tuesday, April 1, 2026

Quick Reference: Data Files

fia_eastern31_recent.csv — Tree Data (each row = one tree)

REF_SPECIES_trimmed.csv — Species Reference (each row = one species)

Question 1: Vectors and Basic R (10 points)

Question 2: Data Frame Inspection (15 points)

Question 3: Modifying Data with dplyr (20 points)

Question 4: Collapsing Data — Group Summaries (20 points)

Question 5: Merging and Appending (20 points)

Question 6: Putting It All Together (15 points)

`fia_eastern31_recent.csv` — Tree Data (each row = one tree)

`REF_SPECIES_trimmed.csv` — Species Reference (each row = one species)