1 Exam Format

  • Duration: [1 hour 25 min]
  • Environment: RStudio only. No internet browser will be available.
  • Allowed help: You may use RStudio’s built-in help() function (e.g., help(left_join) or ?filter) to check syntax or debug code.
  • Data provided: The following two files will be available on the exam computer:
    • fia_eastern31_recent.csv (~2.87 million tree records, 40 columns)
    • REF_SPECIES_trimmed.csv (2,677 species, 9 columns)
  • What to submit: An R script (.R) or R Markdown (.Rmd) file with your answers.

2 What the Exam Covers

The exam tests your ability to write R code using skills from two sources:

2.1 Source 1: Online R Tutorials (Basics through Merging)

From https://simonejdemyr.com/r-tutorials/basics/, you are responsible for the following six tutorials. Practice all scripts and exercises within each tutorial.

2.1.1 Tutorial 1 — Introduction

  • Executing code in R and RStudio (console vs. script)
  • Installing and loading packages with install.packages() and library()
  • R style conventions (variable naming, spacing, commenting)
  • Basic troubleshooting (reading error messages, checking for typos, mismatched parentheses)

2.1.2 Tutorial 2 — Vectors

  • Creating vectors with c(), seq(), rep()
  • Vector types: numeric, character, logical, integer
  • Subsetting vectors by index (x[3]), by condition (x[x > 5]), and by name
  • Modifying vector elements
  • Summary functions: length(), sum(), mean(), min(), max(), range(), table()
  • Handling NA values (na.rm = TRUE)

2.1.3 Tutorial 3 — Dataset Basics

  • What a data frame is (a collection of vectors as columns)
  • Creating data frames with data.frame()
  • Reading data from CSV files (read.csv(), read_csv(), fread())
  • Inspecting data: head(), tail(), str(), summary(), nrow(), ncol(), names()
  • Accessing columns with $ and [[ ]]
  • Tidy data principles (rows = observations, columns = variables)

2.1.4 Tutorial 4 — Modifying Data

  • Subsetting rows with filter()
  • Sorting with arrange() (ascending and descending with desc())
  • Extracting unique rows with distinct()
  • Renaming variables with rename()
  • Dropping or selecting variables with select()
  • Creating new variables with mutate()
  • The pipe operator %>% for chaining operations

2.1.5 Tutorial 5 — Collapsing Data

  • Grouping data with group_by()
  • Summarizing with summarise() / summarize()
  • Common summary functions: n(), n_distinct(), mean(), median(), sd(), min(), max(), sum()
  • Combining group_by() + summarise() for category-level statistics
  • Counting with count() (shortcut for group_by() + summarise(n = n()))
  • Ungrouping with ungroup()

2.1.6 Tutorial 6 — Merging and Appending

  • Merging (adding columns): left_join(), inner_join(), right_join(), full_join()
  • Understanding the by argument for specifying key columns
  • Appending (adding rows): bind_rows()
  • Diagnosing join problems: anti_join() to find non-matching rows
  • Base R alternative: merge() with all.x, all.y, all arguments

2.2 Source 2: Class Lab Materials

You should be familiar with the content of the following two documents:

2.2.1 FIA Data Dictionary

  • Know the key variables and what they represent (e.g., DIA, HT, SPCD, STATUSCD, STATECD, STATE_ABBR, PLT_CN, TREE_CN, GRIDID, etc.)
  • Understand the difference between PLOT (not unique across counties) and PLT_CN (unique plot identifier)
  • Know that STATUSCD == 1 means live tree and COND_STATUS_CD == 1 means forested
  • Know the species group codes: MAJOR_SPGRPCD (1 = Pine, 2 = Other softwood, 3 = Soft hardwood, 4 = Hard hardwood) and SFTWD_HRDWD (“S” vs “H”)

2.2.2 Lab 0 — Loading and Joining

  • Reading large CSV files with read_csv() or fread() instead of read.csv()
  • Joining the tree data with REF_SPECIES using left_join(tree, ref_species, by = "SPCD")
  • Using anti_join() to find unmatched records
  • Subsetting to a single state with filter(STATE_ABBR == "TN")
  • Verifying a join (checking for NA values in joined columns)

3 How to Prepare

The best way to prepare is to practice the tutorial scripts using our FIA datasets. Below are example ways to apply each tutorial’s concepts to the FIA data.

3.1 Vectors (Tutorial 2) — Applied to FIA

# Extract a vector of diameters for Tennessee live trees
tn_dia <- tree$DIA[tree$STATE_ABBR == "TN" & tree$STATUSCD == 1]

# Summarize it
length(tn_dia)
mean(tn_dia, na.rm = TRUE)
max(tn_dia, na.rm = TRUE)

# Logical subsetting: how many trees have DIA > 20 inches?
sum(tn_dia > 20, na.rm = TRUE)

# What proportion of trees are large (DIA > 20)?
mean(tn_dia > 20, na.rm = TRUE)

3.2 Dataset Basics (Tutorial 3) — Applied to FIA

# Inspect the data structure
str(tree)
summary(tree$DIA)
nrow(tree)
ncol(tree)
names(tree)

# Access specific columns
head(tree$COMMON_NAME)
head(tree[["DIA"]])

# Quick look at a subset of columns
head(tree[, c("STATE_ABBR", "SPCD", "DIA", "HT")])

3.3 Modifying Data (Tutorial 4) — Applied to FIA

library(dplyr)

# filter: live trees in Tennessee
tn_live <- tree %>% filter(STATE_ABBR == "TN", STATUSCD == 1)

# arrange: sort by diameter, largest first
tn_live %>% arrange(desc(DIA)) %>% head(10)

# select: keep only key columns
tn_slim <- tn_live %>% select(TREE_CN, SPCD, COMMON_NAME, DIA, HT, LAT, LON)

# mutate: create new variables
tn_slim <- tn_slim %>%
  mutate(
    DIA_cm = DIA * 2.54,
    HT_m   = HT * 0.3048,
    BA_sqft = pi / 4 * (DIA / 12)^2   # basal area in square feet
  )

# distinct: unique species in Tennessee
tn_live %>% distinct(SPCD, COMMON_NAME) %>% nrow()

# rename
tn_slim <- tn_slim %>% rename(diameter_in = DIA, height_ft = HT)

3.4 Collapsing Data (Tutorial 5) — Applied to FIA

# Count trees per state
tree %>%
  filter(STATUSCD == 1) %>%
  count(STATE_ABBR, sort = TRUE)

# Mean diameter by species (top 10 largest)
tree %>%
  filter(STATUSCD == 1) %>%
  group_by(COMMON_NAME) %>%
  summarise(
    n_trees  = n(),
    mean_DIA = mean(DIA, na.rm = TRUE),
    max_DIA  = max(DIA, na.rm = TRUE)
  ) %>%
  filter(n_trees > 1000) %>%
  arrange(desc(mean_DIA)) %>%
  head(10)

# Number of species per state
tree %>%
  filter(STATUSCD == 1) %>%
  group_by(STATE_ABBR) %>%
  summarise(n_species = n_distinct(SPCD)) %>%
  arrange(desc(n_species))

3.5 Merging and Appending (Tutorial 6) — Applied to FIA

# The core join you already know
tree_joined <- tree %>% left_join(ref_species, by = "SPCD")

# Find trees whose SPCD is NOT in the reference table
tree %>% anti_join(ref_species, by = "SPCD") %>% distinct(SPCD)

# Find reference species NOT in our tree data (western/tropical species)
ref_species %>% anti_join(tree, by = "SPCD") %>% nrow()

# Appending: combine Tennessee and North Carolina subsets
tn <- tree %>% filter(STATE_ABBR == "TN")
nc <- tree %>% filter(STATE_ABBR == "NC")
tn_nc <- bind_rows(tn, nc)
nrow(tn) + nrow(nc) == nrow(tn_nc)  # should be TRUE

# inner_join vs left_join: what's the difference in row count?
nrow(tree %>% left_join(ref_species, by = "SPCD"))
nrow(tree %>% inner_join(ref_species, by = "SPCD"))

4 Sample Exam-Style Questions

Below are examples of the types of questions you may see. These are practice only — the actual exam will have different questions.

Q1 (Vectors): Extract the heights (HT) of all live loblolly pine trees (SPCD == 131) in the dataset. How many trees are there? What is the mean height? What is the tallest loblolly pine in the dataset?

Q2 (Data frames): How many columns does fia_eastern31_recent.csv have? How many rows? List the column names that relate to tree measurements (not plot or location info).

Q3 (Modifying data): Using dplyr, create a new data frame called big_oaks that contains only live trees from the genus “Quercus” with a diameter greater than 24 inches. Include only the columns: TREE_CN, COMMON_NAME, DIA, HT, STATE_ABBR. Sort the result by DIA in descending order.

Q4 (Collapsing data): For each state (STATE_ABBR), calculate the total number of live trees, the number of unique species, and the mean diameter. Which state has the highest species richness?

Q5 (Merging): Suppose someone gives you a separate CSV with state-level climate data (STATE_ABBR, mean_temp, annual_precip). Write the code to join this to a state-level summary of the tree data. What type of join would you use and why?

Q6 (Appending): Write code to split the tree data into softwoods and hardwoods (using SFTWD_HRDWD), then re-combine them with bind_rows(). Verify the total row count matches the original.

Q7 (Anti-join): A colleague gives you a list of 50 species codes they are interested in (as a data frame called target_species with one column SPCD). Write code to find which of their 50 species are NOT present in our FIA data.


5 Reminders

  • Load packages first. You will likely need: library(dplyr) and library(readr) (or library(data.table)).
  • Use read_csv() or fread() to read the large tree data. Do NOT use read.csv() — it will be extremely slow.
  • Filter to live trees (STATUSCD == 1) for most questions unless told otherwise.
  • Join before answering species-name questions. If a question asks about a species by common name, you need to join with REF_SPECIES first (or filter by SPCD directly if the code is given).
  • Use help() in RStudio if you forget a function’s arguments. For example, help(left_join) or ?arrange.
  • na.rm = TRUE — many summary functions (mean, sum, max, etc.) will return NA if any values are missing. Always include na.rm = TRUE.
  • Comment your code. Partial credit may be given for correct logic even if the code has a small syntax error.

Good luck! The best preparation is hands-on practice with R and the FIA data.