Name: ____________________
Time: 1 hour 25 minutes | Total: 100 points
Rules:
help() or ? in the console
(e.g., ?left_join, help(filter)).fia_eastern31_recent.csv — tree-level FIA data (~2.87
million rows, 40 columns)REF_SPECIES_trimmed.csv — species reference table
(2,677 rows, 9 columns)fia_eastern31_recent.csv — Tree Data (each row = one
tree)| Variable | Type | Description |
|---|---|---|
TREE_CN |
numeric | Unique tree identifier |
PLT_CN |
numeric | Unique plot identifier |
STATECD |
integer | State FIPS code (e.g., 47 = Tennessee) |
STATE_ABBR |
character | Two-letter state abbreviation (e.g., “TN”, “FL”) |
SPCD |
integer | Species code — key to join with REF_SPECIES |
DIA |
numeric | Diameter at breast height (inches) |
HT |
numeric | Total tree height (feet) |
STATUSCD |
integer | 1 = live tree, 2 = standing dead |
LAT |
numeric | Plot latitude (decimal degrees) |
LON |
numeric | Plot longitude (decimal degrees, negative) |
BALIVE |
numeric | Basal area of live trees on the plot (sq ft/acre) |
MAJOR_SPGRPCD |
integer | 1=Pine, 2=Other softwood, 3=Soft hardwood, 4=Hard hardwood |
REF_SPECIES_trimmed.csv — Species Reference (each row =
one species)| Variable | Type | Description |
|---|---|---|
SPCD |
integer | Species code — key to join with tree data |
COMMON_NAME |
character | Common name (e.g., “red maple”, “loblolly pine”) |
GENUS |
character | Genus name (e.g., “Acer”, “Pinus”) |
SPECIES |
character | Species epithet (e.g., “rubrum”, “taeda”) |
SPECIES_SYMBOL |
character | USDA PLANTS symbol (e.g., “ACRU”) |
E_SPGRPCD |
integer | Eastern species group code |
MAJOR_SPGRPCD |
integer | 1=Pine, 2=Other softwood, 3=Soft hardwood, 4=Hard hardwood |
SFTWD_HRDWD |
character | “S” = softwood, “H” = hardwood |
WOODLAND |
character | “Y” if woodland species |
Getting started: Load your packages and read the data first.
library(readr) # or library(data.table)
library(dplyr)
tree <- read_csv("fia_eastern31_recent.csv", show_col_types = FALSE)
ref <- read_csv("REF_SPECIES_trimmed.csv", show_col_types = FALSE)
(a) (3 pts) Create a numeric vector called
diameters containing these 8 values:
5.2, 12.0, 8.7, 3.1, 15.4, 9.9, 22.6, 7.3
Calculate the mean, maximum, and length of this vector.
Hint: use c() to create vectors;
mean(), max(), length() for
summaries.
(b) (3 pts) Using logical subsetting on
diameters, extract only values greater than
10. How many values meet this condition?
Hint: vector_name[vector_name > value] extracts
elements meeting a condition.
(c) (4 pts) Create a character vector called
states containing: "TN", "NC",
"FL", "GA", "VA".
"AL".Hint: use c() with quoted strings; access elements
with [ ] indexing.
(a) (5 pts) After reading
fia_eastern31_recent.csv into an object called
tree, answer:
STATE_ABBR column? What data type
is DIA?Hint: useful functions include nrow(),
ncol(), names(), str(),
class().
(b) (5 pts) Use summary() on the
DIA column (diameter at breast height, in inches):
NA values? How can you tell from the
summary output?Hint: summary(tree$DIA) produces a six-number
summary. If NAs exist, they appear as “NA’s: xxx” at the end.
(c) (5 pts) Display the first 10
rows of just these four columns: STATE_ABBR,
SPCD, DIA, HT.
Hint: you can use base R bracket notation
data[rows, columns] with head(), or dplyr’s
select() with head().
(a) (5 pts) First, join the species reference table
to the tree data so that each tree gets its species name. Use
SPCD as the key:
tree <- tree %>% left_join(ref, by = "SPCD")
Now, using filter(), create a new data frame called
tn_live that contains only live trees in
Tennessee. How many rows does tn_live
have?
Variables to use: STATE_ABBR (value:
"TN" for Tennessee), STATUSCD (value:
1 for live trees).
(b) (5 pts) Using mutate(), add two new
columns to tn_live:
DIA_cm: diameter converted from inches to centimeters
(multiply DIA by 2.54)BA_sqft: individual tree basal area in square feet,
calculated as: \[BA = \frac{\pi}{4} \times
\left(\frac{DIA}{12}\right)^2\] (Note: DIA is in
inches; dividing by 12 converts to feet. Use pi for π in
R.)Show the first 6 rows of COMMON_NAME, DIA,
DIA_cm, and BA_sqft.
Variables to use: DIA (diameter in inches, from tree
data), COMMON_NAME (species name, from ref after
join).
(c) (5 pts) Write a single chained
command using the pipe operator (%>%) that:
tn_liveDIA > 20 inchesCOMMON_NAME, DIA, and
HTDIA in descending
orderVariables to use: DIA (diameter, inches),
HT (height, feet), COMMON_NAME (species
name).
Hint: arrange(desc(DIA)) sorts largest first;
head(10) or slice_head(n = 10) limits
output.
(d) (5 pts) How many unique species
(unique COMMON_NAME values) are among live trees in
Tennessee?
Hint: you can use distinct(),
n_distinct(), or length(unique()).
(a) (7 pts) Using the full tree data (all 31 states,
live trees only: STATUSCD == 1), calculate
the following summary statistics for each state
(STATE_ABBR):
n_trees: total number of live trees (hint:
n())n_species: number of unique species (hint:
n_distinct(SPCD))mean_DIA: mean diameter (hint:
mean(DIA, na.rm = TRUE))Sort the result by n_species in descending order.
Which state has the most species?
Variables to use: STATUSCD (1 = live),
STATE_ABBR, SPCD, DIA.
(b) (7 pts) For live trees in Tennessee
only (STATE_ABBR == "TN",
STATUSCD == 1), find the top 5 most common
species by tree count. Your output should show:
COMMON_NAME: species namen: number of trees (hint:
n())mean_DIA: mean diameter in inches (hint:
mean(DIA, na.rm = TRUE))mean_HT: mean height in feet (hint:
mean(HT, na.rm = TRUE))Sort by count (largest first) and display only the top 5.
Variables to use: STATE_ABBR, STATUSCD,
COMMON_NAME, DIA, HT.
(c) (6 pts) Using all 31-state live tree data
(STATUSCD == 1), calculate the number of live
trees and mean diameter for each
MAJOR_SPGRPCD group.
For reference, the MAJOR_SPGRPCD codes are:
| Code | Group |
|---|---|
| 1 | Pine |
| 2 | Other softwood |
| 3 | Soft hardwood |
| 4 | Hard hardwood |
Variables to use: STATUSCD,
MAJOR_SPGRPCD, DIA.
(a) (5 pts) Concept question. In
your own words (as code comments), explain the difference between
left_join() and inner_join().
Suppose the tree data has 100 rows and 3 of
those rows have an SPCD value that does not exist
in the reference table (ref):
left_join(tree, ref, by = "SPCD")
return?inner_join(tree, ref, by = "SPCD")
return?COMMON_NAME column for those 3
unmatched rows in a left_join?Write your answers as comments (#) in your
script.
Variables involved: SPCD (the join key in both
files), COMMON_NAME (from ref).
(b) (5 pts) Use anti_join() to find
species codes (SPCD) in the reference
table (ref) that are not present
in the tree data (tree). How many species in the reference
table have zero trees in our eastern US dataset? Why might this be?
(Answer the “why” as a comment.)
Hint: anti_join(A, B, by = "key") returns rows in A
that have no match in B. Direction matters!
Variables to use: SPCD (the join key in both
files).
(c) (5 pts) Create two separate data frames:
fl_trees: live trees in Florida
(STATE_ABBR == "FL", STATUSCD == 1)ga_trees: live trees in Georgia
(STATE_ABBR == "GA", STATUSCD == 1)Use bind_rows() to combine (append) them into a single
data frame called fl_ga. Verify that the number of rows in
fl_ga equals
nrow(fl_trees) + nrow(ga_trees).
Variables to use: STATE_ABBR,
STATUSCD.
Hint: bind_rows() stacks data frames on top of each
other (adds rows). This is “appending.”
(d) (5 pts) Suppose you have the following small data frame of state-level information:
state_info <- data.frame(
STATE_ABBR = c("TN", "NC", "FL", "GA", "VA"),
region = c("Mid-South", "Southeast", "Southeast", "Southeast", "Mid-Atlantic")
)
Write code to:
STATUSCD == 1) per
state. (Hint: use count() or
group_by() + summarise(n = n()))state_info onto that summary using
STATE_ABBR as the key.Answer in a comment: Which type of join should you
use so that all 31 states remain in the result, even
those not listed in state_info? What will the
region column show for states like “ME” or “WI” that are
not in state_info?
Variables to use: STATE_ABBR (the join key),
STATUSCD.
This is more challenging — take your time and build it step by step.
A colleague asks: “Among the 10 most common tree species in the eastern US, which species tends to grow at the highest latitudes and which at the lowest?”
Write a complete pipeline that:
tree data (already joined with
ref).STATUSCD == 1).COMMON_NAME to identify species. (Hint:
count(COMMON_NAME, sort = TRUE) then
slice_head(n = 10), then pull(COMMON_NAME) to
extract as a vector.)COMMON_NAME %in% top10)group_by(COMMON_NAME)),
calculates:
n_trees: total number of trees — n()mean_lat: mean latitude —
mean(LAT, na.rm = TRUE)mean_DIA: mean diameter —
mean(DIA, na.rm = TRUE)wood_type: softwood or hardwood —
first(SFTWD_HRDWD)mean_lat in
descending order (highest latitude first).Then, answer in a comment: Which species has the highest mean latitude? Which has the lowest? Is there a pattern between softwood/hardwood and latitude?
Variables to use: STATUSCD (1 = live),
COMMON_NAME (species name), LAT (latitude in
decimal degrees), DIA (diameter in inches),
SFTWD_HRDWD (“S” = softwood, “H” = hardwood).
— END OF EXAM —
Save your file and make sure your name is at the top. Good luck!