help() function (e.g., help(left_join) or
?filter) to check syntax or debug code.fia_eastern31_recent.csv (~2.87 million tree records,
40 columns)REF_SPECIES_trimmed.csv (2,677 species, 9 columns).R) or R
Markdown (.Rmd) file with your answers.The exam tests your ability to write R code using skills from two sources:
From https://simonejdemyr.com/r-tutorials/basics/, you are responsible for the following six tutorials. Practice all scripts and exercises within each tutorial.
install.packages()
and library()c(), seq(),
rep()x[3]), by condition
(x[x > 5]), and by namelength(), sum(),
mean(), min(), max(),
range(), table()NA values (na.rm = TRUE)data.frame()read.csv(),
read_csv(), fread())head(), tail(),
str(), summary(), nrow(),
ncol(), names()$ and [[ ]]filter()arrange() (ascending and descending with
desc())distinct()rename()select()mutate()%>% for chaining operationsgroup_by()summarise() /
summarize()n(),
n_distinct(), mean(), median(),
sd(), min(), max(),
sum()group_by() + summarise() for
category-level statisticscount() (shortcut for
group_by() + summarise(n = n()))ungroup()left_join(),
inner_join(), right_join(),
full_join()by argument for specifying key
columnsbind_rows()anti_join() to find
non-matching rowsmerge() with all.x,
all.y, all argumentsYou should be familiar with the content of the following two documents:
DIA, HT, SPCD,
STATUSCD, STATECD, STATE_ABBR,
PLT_CN, TREE_CN, GRIDID,
etc.)PLOT (not unique
across counties) and PLT_CN (unique plot identifier)STATUSCD == 1 means live tree and
COND_STATUS_CD == 1 means forestedMAJOR_SPGRPCD (1 = Pine,
2 = Other softwood, 3 = Soft hardwood, 4 = Hard hardwood) and
SFTWD_HRDWD (“S” vs “H”)read_csv() or
fread() instead of read.csv()left_join(tree, ref_species, by = "SPCD")anti_join() to find unmatched recordsfilter(STATE_ABBR == "TN")The best way to prepare is to practice the tutorial scripts using our FIA datasets. Below are example ways to apply each tutorial’s concepts to the FIA data.
# Extract a vector of diameters for Tennessee live trees
tn_dia <- tree$DIA[tree$STATE_ABBR == "TN" & tree$STATUSCD == 1]
# Summarize it
length(tn_dia)
mean(tn_dia, na.rm = TRUE)
max(tn_dia, na.rm = TRUE)
# Logical subsetting: how many trees have DIA > 20 inches?
sum(tn_dia > 20, na.rm = TRUE)
# What proportion of trees are large (DIA > 20)?
mean(tn_dia > 20, na.rm = TRUE)
# Inspect the data structure
str(tree)
summary(tree$DIA)
nrow(tree)
ncol(tree)
names(tree)
# Access specific columns
head(tree$COMMON_NAME)
head(tree[["DIA"]])
# Quick look at a subset of columns
head(tree[, c("STATE_ABBR", "SPCD", "DIA", "HT")])
library(dplyr)
# filter: live trees in Tennessee
tn_live <- tree %>% filter(STATE_ABBR == "TN", STATUSCD == 1)
# arrange: sort by diameter, largest first
tn_live %>% arrange(desc(DIA)) %>% head(10)
# select: keep only key columns
tn_slim <- tn_live %>% select(TREE_CN, SPCD, COMMON_NAME, DIA, HT, LAT, LON)
# mutate: create new variables
tn_slim <- tn_slim %>%
mutate(
DIA_cm = DIA * 2.54,
HT_m = HT * 0.3048,
BA_sqft = pi / 4 * (DIA / 12)^2 # basal area in square feet
)
# distinct: unique species in Tennessee
tn_live %>% distinct(SPCD, COMMON_NAME) %>% nrow()
# rename
tn_slim <- tn_slim %>% rename(diameter_in = DIA, height_ft = HT)
# Count trees per state
tree %>%
filter(STATUSCD == 1) %>%
count(STATE_ABBR, sort = TRUE)
# Mean diameter by species (top 10 largest)
tree %>%
filter(STATUSCD == 1) %>%
group_by(COMMON_NAME) %>%
summarise(
n_trees = n(),
mean_DIA = mean(DIA, na.rm = TRUE),
max_DIA = max(DIA, na.rm = TRUE)
) %>%
filter(n_trees > 1000) %>%
arrange(desc(mean_DIA)) %>%
head(10)
# Number of species per state
tree %>%
filter(STATUSCD == 1) %>%
group_by(STATE_ABBR) %>%
summarise(n_species = n_distinct(SPCD)) %>%
arrange(desc(n_species))
# The core join you already know
tree_joined <- tree %>% left_join(ref_species, by = "SPCD")
# Find trees whose SPCD is NOT in the reference table
tree %>% anti_join(ref_species, by = "SPCD") %>% distinct(SPCD)
# Find reference species NOT in our tree data (western/tropical species)
ref_species %>% anti_join(tree, by = "SPCD") %>% nrow()
# Appending: combine Tennessee and North Carolina subsets
tn <- tree %>% filter(STATE_ABBR == "TN")
nc <- tree %>% filter(STATE_ABBR == "NC")
tn_nc <- bind_rows(tn, nc)
nrow(tn) + nrow(nc) == nrow(tn_nc) # should be TRUE
# inner_join vs left_join: what's the difference in row count?
nrow(tree %>% left_join(ref_species, by = "SPCD"))
nrow(tree %>% inner_join(ref_species, by = "SPCD"))
Below are examples of the types of questions you may see. These are practice only — the actual exam will have different questions.
Q1 (Vectors): Extract the heights (
HT) of all live loblolly pine trees (SPCD == 131) in the dataset. How many trees are there? What is the mean height? What is the tallest loblolly pine in the dataset?
Q2 (Data frames): How many columns does
fia_eastern31_recent.csvhave? How many rows? List the column names that relate to tree measurements (not plot or location info).
Q3 (Modifying data): Using dplyr, create a new data frame called
big_oaksthat contains only live trees from the genus “Quercus” with a diameter greater than 24 inches. Include only the columns:TREE_CN,COMMON_NAME,DIA,HT,STATE_ABBR. Sort the result byDIAin descending order.
Q4 (Collapsing data): For each state (
STATE_ABBR), calculate the total number of live trees, the number of unique species, and the mean diameter. Which state has the highest species richness?
Q5 (Merging): Suppose someone gives you a separate CSV with state-level climate data (
STATE_ABBR,mean_temp,annual_precip). Write the code to join this to a state-level summary of the tree data. What type of join would you use and why?
Q6 (Appending): Write code to split the tree data into softwoods and hardwoods (using
SFTWD_HRDWD), then re-combine them withbind_rows(). Verify the total row count matches the original.
Q7 (Anti-join): A colleague gives you a list of 50 species codes they are interested in (as a data frame called
target_specieswith one columnSPCD). Write code to find which of their 50 species are NOT present in our FIA data.
library(dplyr) and library(readr) (or
library(data.table)).read_csv() or fread()
to read the large tree data. Do NOT use read.csv() — it
will be extremely slow.STATUSCD == 1)
for most questions unless told otherwise.help() in RStudio if you forget a
function’s arguments. For example, help(left_join) or
?arrange.na.rm = TRUE — many summary functions
(mean, sum, max, etc.) will
return NA if any values are missing. Always include
na.rm = TRUE.Good luck! The best preparation is hands-on practice with R and the FIA data.