Name: ____________________

Time: 1 hour 25 minutes | Total: 100 points

Rules:


Quick Reference: Data Files

fia_eastern31_recent.csv — Tree Data (each row = one tree)

Variable Type Description
TREE_CN numeric Unique tree identifier
PLT_CN numeric Unique plot identifier
STATECD integer State FIPS code (e.g., 47 = Tennessee)
STATE_ABBR character Two-letter state abbreviation (e.g., “TN”, “FL”)
SPCD integer Species code — key to join with REF_SPECIES
DIA numeric Diameter at breast height (inches)
HT numeric Total tree height (feet)
STATUSCD integer 1 = live tree, 2 = standing dead
LAT numeric Plot latitude (decimal degrees)
LON numeric Plot longitude (decimal degrees, negative)
BALIVE numeric Basal area of live trees on the plot (sq ft/acre)
MAJOR_SPGRPCD integer 1=Pine, 2=Other softwood, 3=Soft hardwood, 4=Hard hardwood

REF_SPECIES_trimmed.csv — Species Reference (each row = one species)

Variable Type Description
SPCD integer Species code — key to join with tree data
COMMON_NAME character Common name (e.g., “red maple”, “loblolly pine”)
GENUS character Genus name (e.g., “Acer”, “Pinus”)
SPECIES character Species epithet (e.g., “rubrum”, “taeda”)
SPECIES_SYMBOL character USDA PLANTS symbol (e.g., “ACRU”)
E_SPGRPCD integer Eastern species group code
MAJOR_SPGRPCD integer 1=Pine, 2=Other softwood, 3=Soft hardwood, 4=Hard hardwood
SFTWD_HRDWD character “S” = softwood, “H” = hardwood
WOODLAND character “Y” if woodland species

Getting started: Load your packages and read the data first.

library(readr)   # or library(data.table)
library(dplyr)

tree <- read_csv("fia_eastern31_recent.csv", show_col_types = FALSE)
ref  <- read_csv("REF_SPECIES_trimmed.csv", show_col_types = FALSE)

Question 1: Vectors and Basic R (10 points)

(a) (3 pts) Create a numeric vector called diameters containing these 8 values:

5.2, 12.0, 8.7, 3.1, 15.4, 9.9, 22.6, 7.3

Calculate the mean, maximum, and length of this vector.

Hint: use c() to create vectors; mean(), max(), length() for summaries.

(b) (3 pts) Using logical subsetting on diameters, extract only values greater than 10. How many values meet this condition?

Hint: vector_name[vector_name > value] extracts elements meeting a condition.

(c) (4 pts) Create a character vector called states containing: "TN", "NC", "FL", "GA", "VA".

Hint: use c() with quoted strings; access elements with [ ] indexing.


Question 2: Data Frame Inspection (15 points)

(a) (5 pts) After reading fia_eastern31_recent.csv into an object called tree, answer:

Hint: useful functions include nrow(), ncol(), names(), str(), class().

(b) (5 pts) Use summary() on the DIA column (diameter at breast height, in inches):

Hint: summary(tree$DIA) produces a six-number summary. If NAs exist, they appear as “NA’s: xxx” at the end.

(c) (5 pts) Display the first 10 rows of just these four columns: STATE_ABBR, SPCD, DIA, HT.

Hint: you can use base R bracket notation data[rows, columns] with head(), or dplyr’s select() with head().


Question 3: Modifying Data with dplyr (20 points)

(a) (5 pts) First, join the species reference table to the tree data so that each tree gets its species name. Use SPCD as the key:

tree <- tree %>% left_join(ref, by = "SPCD")

Now, using filter(), create a new data frame called tn_live that contains only live trees in Tennessee. How many rows does tn_live have?

Variables to use: STATE_ABBR (value: "TN" for Tennessee), STATUSCD (value: 1 for live trees).

(b) (5 pts) Using mutate(), add two new columns to tn_live:

Show the first 6 rows of COMMON_NAME, DIA, DIA_cm, and BA_sqft.

Variables to use: DIA (diameter in inches, from tree data), COMMON_NAME (species name, from ref after join).

(c) (5 pts) Write a single chained command using the pipe operator (%>%) that:

  1. Starts with tn_live
  2. Filters to trees with DIA > 20 inches
  3. Selects only COMMON_NAME, DIA, and HT
  4. Arranges by DIA in descending order
  5. Shows the top 10 rows

Variables to use: DIA (diameter, inches), HT (height, feet), COMMON_NAME (species name).

Hint: arrange(desc(DIA)) sorts largest first; head(10) or slice_head(n = 10) limits output.

(d) (5 pts) How many unique species (unique COMMON_NAME values) are among live trees in Tennessee?

Hint: you can use distinct(), n_distinct(), or length(unique()).


Question 4: Collapsing Data — Group Summaries (20 points)

(a) (7 pts) Using the full tree data (all 31 states, live trees only: STATUSCD == 1), calculate the following summary statistics for each state (STATE_ABBR):

Sort the result by n_species in descending order. Which state has the most species?

Variables to use: STATUSCD (1 = live), STATE_ABBR, SPCD, DIA.

(b) (7 pts) For live trees in Tennessee only (STATE_ABBR == "TN", STATUSCD == 1), find the top 5 most common species by tree count. Your output should show:

Sort by count (largest first) and display only the top 5.

Variables to use: STATE_ABBR, STATUSCD, COMMON_NAME, DIA, HT.

(c) (6 pts) Using all 31-state live tree data (STATUSCD == 1), calculate the number of live trees and mean diameter for each MAJOR_SPGRPCD group.

For reference, the MAJOR_SPGRPCD codes are:

Code Group
1 Pine
2 Other softwood
3 Soft hardwood
4 Hard hardwood

Variables to use: STATUSCD, MAJOR_SPGRPCD, DIA.


Question 5: Merging and Appending (20 points)

(a) (5 pts) Concept question. In your own words (as code comments), explain the difference between left_join() and inner_join().

Suppose the tree data has 100 rows and 3 of those rows have an SPCD value that does not exist in the reference table (ref):

Write your answers as comments (#) in your script.

Variables involved: SPCD (the join key in both files), COMMON_NAME (from ref).

(b) (5 pts) Use anti_join() to find species codes (SPCD) in the reference table (ref) that are not present in the tree data (tree). How many species in the reference table have zero trees in our eastern US dataset? Why might this be? (Answer the “why” as a comment.)

Hint: anti_join(A, B, by = "key") returns rows in A that have no match in B. Direction matters!

Variables to use: SPCD (the join key in both files).

(c) (5 pts) Create two separate data frames:

Use bind_rows() to combine (append) them into a single data frame called fl_ga. Verify that the number of rows in fl_ga equals nrow(fl_trees) + nrow(ga_trees).

Variables to use: STATE_ABBR, STATUSCD.

Hint: bind_rows() stacks data frames on top of each other (adds rows). This is “appending.”

(d) (5 pts) Suppose you have the following small data frame of state-level information:

state_info <- data.frame(
  STATE_ABBR = c("TN", "NC", "FL", "GA", "VA"),
  region = c("Mid-South", "Southeast", "Southeast", "Southeast", "Mid-Atlantic")
)

Write code to:

  1. First, create a summary with one row per state showing the total number of live trees (STATUSCD == 1) per state. (Hint: use count() or group_by() + summarise(n = n()))
  2. Then, join state_info onto that summary using STATE_ABBR as the key.

Answer in a comment: Which type of join should you use so that all 31 states remain in the result, even those not listed in state_info? What will the region column show for states like “ME” or “WI” that are not in state_info?

Variables to use: STATE_ABBR (the join key), STATUSCD.


Question 6: Putting It All Together (15 points)

This is more challenging — take your time and build it step by step.

A colleague asks: “Among the 10 most common tree species in the eastern US, which species tends to grow at the highest latitudes and which at the lowest?”

Write a complete pipeline that:

  1. Starts with the full tree data (already joined with ref).
  2. Filters to live trees only (STATUSCD == 1).
  3. Identifies the 10 most common species by total tree count. Use COMMON_NAME to identify species. (Hint: count(COMMON_NAME, sort = TRUE) then slice_head(n = 10), then pull(COMMON_NAME) to extract as a vector.)
  4. Filters the tree data to include only trees belonging to those top-10 species. (Hint: COMMON_NAME %in% top10)
  5. For each of those 10 species (group_by(COMMON_NAME)), calculates:
    • n_trees: total number of trees — n()
    • mean_lat: mean latitude — mean(LAT, na.rm = TRUE)
    • mean_DIA: mean diameter — mean(DIA, na.rm = TRUE)
    • wood_type: softwood or hardwood — first(SFTWD_HRDWD)
  6. Arranges the result by mean_lat in descending order (highest latitude first).
  7. Displays the final table.

Then, answer in a comment: Which species has the highest mean latitude? Which has the lowest? Is there a pattern between softwood/hardwood and latitude?

Variables to use: STATUSCD (1 = live), COMMON_NAME (species name), LAT (latitude in decimal degrees), DIA (diameter in inches), SFTWD_HRDWD (“S” = softwood, “H” = hardwood).


— END OF EXAM —

Save your file and make sure your name is at the top. Good luck!