Workshop Overview

Who Is This Workshop For?

This workshop is designed for entomology students who are new to R and may need a refresher on ecological concepts related to biodiversity data analysis. No prior R experience is required!

By the end of this workshop, you will be able to:

  • Import and clean insect survey data
  • Decide which insect groups have β€œinteresting stories” worth analyzing
  • Calculate and interpret diversity metrics
  • Visualize community patterns
  • Test whether habitats differ in their insect communities
  • Interpret results in an ecological context

The Ecological Question Behind Everything

When we set pitfall traps across different habitats, we’re fundamentally asking:

β€œDo different environments support different insect communities, and why?”

This question connects to foundational ecological theories:

Key Ecological Concepts:

  • Niche Theory: Each species has specific environmental requirements (temperature, moisture, food). Species occur where conditions match their niche.

  • Habitat Filtering: The environment β€œfilters” which species can survive. Forest conditions filter out sun-loving species; agricultural disturbance filters out sensitive specialists.

  • Species-Area Relationship: Larger, more complex habitats typically support more species.

  • Indicator Species: Some species are sensitive to environmental conditions and can indicate habitat quality or disturbance level.

Our statistical analyses are tools to detect these ecological patterns and test whether the differences we observe are real or just due to chance.

The Complete Analysis Roadmap

Before writing any code, understand the journey from raw data to ecological insights:

╔═══════════════════════════════════════════════════════════════════════════════╗
β•‘                    COMPLETE ANALYSIS WORKFLOW                                  β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚   RAW DATA      β”‚
                              β”‚  (Your pitfall  β”‚
                              β”‚   trap counts)  β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                              β–Ό                              β”‚
        β”‚  ╔═══════════════════════════════════════════════════════╗  β”‚
        β”‚  β•‘  PHASE 1: DATA PREPARATION (Day 1, Sessions 1-3)      β•‘  β”‚
        β”‚  ╠═══════════════════════════════════════════════════════╣  β”‚
        β”‚  β•‘  β€’ Import data into R                                 β•‘  β”‚
        β”‚  β•‘  β€’ Check for errors (typos, impossible values)        β•‘  β”‚
        β”‚  β•‘  β€’ Standardize names (taxonomy is messy!)             β•‘  β”‚
        β”‚  β•‘  β€’ Create community matrix (sites Γ— species)          β•‘  β”‚
        β”‚  β•‘                                                       β•‘  β”‚
        β”‚  β•‘  KEY QUESTION: "Is my data clean and trustworthy?"    β•‘  β”‚
        β”‚  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
        β”‚                              β”‚                              β”‚
        β”‚                              β–Ό                              β”‚
        β”‚  ╔═══════════════════════════════════════════════════════╗  β”‚
        β”‚  β•‘  PHASE 2: DATA EXPLORATION (Day 1, Session 4)         β•‘  β”‚
        β”‚  ╠═══════════════════════════════════════════════════════╣  β”‚
        β”‚  β•‘  β€’ Summarize: How many species? How many individuals? β•‘  β”‚
        β”‚  β•‘  β€’ Explore: Which insect ORDERS are abundant?         β•‘  β”‚
        β”‚  β•‘  β€’ Decide: Which groups have "interesting stories"?   β•‘  β”‚
        β”‚  β•‘  β€’ Visualize: What patterns can we SEE?               β•‘  β”‚
        β”‚  β•‘                                                       β•‘  β”‚
        β”‚  β•‘  KEY QUESTION: "Which insect groups should I focus    β•‘  β”‚
        β”‚  β•‘                 my analysis on?"                      β•‘  β”‚
        β”‚  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
        β”‚                              β”‚                              β”‚
   D    β”‚                              β–Ό                              β”‚
   A    β”‚  ╔═══════════════════════════════════════════════════════╗  β”‚
   Y    β”‚  β•‘  PHASE 3: DIVERSITY ANALYSIS (Day 2, Session 5)       β•‘  β”‚
        β”‚  ╠═══════════════════════════════════════════════════════╣  β”‚
   1    β”‚  β•‘  ALPHA DIVERSITY (within-site)                        β•‘  β”‚
        β”‚  β•‘  β€’ How many species at each site? (Richness)          β•‘  β”‚
   |    β”‚  β•‘  β€’ How diverse is each site? (Shannon, Simpson)       β•‘  β”‚
        β”‚  β•‘  β€’ Are differences due to sampling effort?            β•‘  β”‚
   D    β”‚  β•‘    (Rarefaction)                                      β•‘  β”‚
   A    β”‚  β•‘                                                       β•‘  β”‚
   Y    β”‚  β•‘  KEY QUESTION: "Which habitats are most diverse?"     β•‘  β”‚
        β”‚  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
   2    β”‚                              β”‚                              β”‚
        β”‚                              β–Ό                              β”‚
        β”‚  ╔═══════════════════════════════════════════════════════╗  β”‚
        β”‚  β•‘  PHASE 4: COMMUNITY ANALYSIS (Day 2, Sessions 6-7)    β•‘  β”‚
        β”‚  ╠═══════════════════════════════════════════════════════╣  β”‚
        β”‚  β•‘  BETA DIVERSITY (between-sites)                       β•‘  β”‚
        β”‚  β•‘  β€’ How different are communities? (Distance matrix)   β•‘  β”‚
        β”‚  β•‘  β€’ Can we VISUALIZE the pattern? (NMDS/PCoA)          β•‘  β”‚
        β”‚  β•‘  β€’ Is the pattern SIGNIFICANT? (PERMANOVA)            β•‘  β”‚
        β”‚  β•‘  β€’ WHICH species drive differences? (SIMPER)          β•‘  β”‚
        β”‚  β•‘                                                       β•‘  β”‚
        β”‚  β•‘  KEY QUESTION: "Do habitats have different            β•‘  β”‚
        β”‚  β•‘                 communities? Which species matter?"   β•‘  β”‚
        β”‚  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
        β”‚                              β”‚                              β”‚
        β”‚                              β–Ό                              β”‚
        β”‚  ╔═══════════════════════════════════════════════════════╗  β”‚
        β”‚  β•‘  PHASE 5: INTERPRETATION (Day 2, Session 8)           β•‘  β”‚
        β”‚  ╠═══════════════════════════════════════════════════════╣  β”‚
        β”‚  β•‘  β€’ What do the numbers MEAN ecologically?             β•‘  β”‚
        β”‚  β•‘  β€’ What MECHANISMS might explain the patterns?        β•‘  β”‚
        β”‚  β•‘  β€’ What are the LIMITATIONS?                          β•‘  β”‚
        β”‚  β•‘  β€’ What NEW QUESTIONS arise?                          β•‘  β”‚
        β”‚  β•‘                                                       β•‘  β”‚
        β”‚  β•‘  KEY QUESTION: "So what? Why does this matter?"       β•‘  β”‚
        β”‚  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
        β”‚                              β”‚                              β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β–Ό
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚  PUBLICATION /  β”‚
                              β”‚  THESIS CHAPTER β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Workshop Schedule

╔══════════════════════════════════════════════════════════════════════════════╗
β•‘  DAY 1: DATA FOUNDATIONS & EXPLORATION           9:00 - 15:00                β•‘
╠══════════════════════════════════════════════════════════════════════════════╣
β•‘  09:00 - 10:15  Session 1: R Basics & Importing Data               75 min    β•‘
β•‘  10:15 - 10:30  β˜• Break                                            15 min    β•‘
β•‘  10:30 - 11:30  Session 2: Data Wrangling with tidyverse           60 min    β•‘
β•‘  11:30 - 12:30  🍽️ Lunch                                            60 min   β•‘
β•‘  12:30 - 13:45  Session 3: Data Cleaning & Community Matrices      75 min    β•‘
β•‘  13:45 - 14:00  β˜• Break                                            15 min    β•‘
β•‘  14:00 - 15:00  Session 4: Choosing Your Focal Taxa                60 min    β•‘
╠══════════════════════════════════════════════════════════════════════════════╣
β•‘  DAY 2: ANALYSIS & INTERPRETATION                9:00 - 15:00                β•‘
╠══════════════════════════════════════════════════════════════════════════════╣
β•‘  09:00 - 10:00  Session 5: Diversity Metrics & Visualization       60 min    β•‘
β•‘  10:00 - 10:15  β˜• Break                                            15 min    β•‘
β•‘  10:15 - 11:15  Session 6: Ordination - Seeing Community Patterns  60 min    β•‘
β•‘  11:15 - 12:15  🍽️ Lunch                                            60 min   β•‘
β•‘  12:15 - 13:30  Session 7: Statistical Testing                     75 min    β•‘
β•‘  13:30 - 13:45  β˜• Break                                            15 min    β•‘
β•‘  13:45 - 14:45  Session 8: Interpretation & Synthesis              60 min    β•‘
β•‘  14:45 - 15:00  Wrap-up & Next Steps                               15 min    β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Our Dataset

We’ll use sample_insect_data.csv - a pitfall trap survey of ground-dwelling arthropods.

About Pitfall Traps:

Pitfall traps are cups buried flush with the ground surface. Ground-active arthropods fall in and are collected. This method is:

  • Good for: Ground beetles (Carabidae), ants (Formicidae), spiders (Araneae), rove beetles (Staphylinidae)
  • Poor for: Flying insects, canopy-dwellers, sedentary species
  • Biased by: Trap activity (more active = more caught), body size, weather

Always consider sampling bias when interpreting results!

Column Description
site Site identifier (S01-S12)
habitat Habitat type (forest, grassland, agriculture)
landscape Landscape complexity (complex, simple)
order Insect order (Coleoptera, Hymenoptera, etc.)
family Insect family
morphospecies Species identifier
abundance Number of individuals

DAY 1: DATA FOUNDATIONS & EXPLORATION


1 Session 1: R Basics & Importing Data

Time: 09:00 - 10:15 (75 minutes)

1.1 Why R for Ecology?

Why do we use R instead of Excel?

  1. Reproducibility: Your analysis is saved as a script. Anyone (including future you) can re-run it exactly.

  2. Ecological packages: R has specialized packages like vegan designed specifically for community ecology.

  3. Handling complexity: Ecological data often has hundreds of species and complex designs. R handles this easily.

  4. Publication-quality figures: R creates beautiful, customizable plots ready for journals.

1.2 Loading Packages

# Packages are collections of functions that extend R's capabilities
# We load them at the start of EVERY session

library(tidyverse)    # Data wrangling and visualization
library(vegan)        # Community ecology analyses (diversity, ordination)
library(ape)          # Phylogenetic tools and PCoA ordination

# If you get "there is no package called 'x'", install it first:
# install.packages("x")

Why these packages?

  • tidyverse: Makes data manipulation intuitive with β€œverbs” like filter, select, mutate
  • vegan: The gold standard for community ecology - diversity indices, NMDS, PERMANOVA
  • ape: Provides PCoA ordination and phylogenetic analysis tools

1.3 Basic R Operations

# R as a calculator
2 + 3         # Addition
10 / 2        # Division
5^2           # Power (5 squared = 25)
sqrt(16)      # Square root

# Creating objects with <- (assignment operator)
species_count <- 42           # Store a number
site_name <- "Forest_A"       # Store text

# Vectors: collections of values (fundamental to R!)
abundance <- c(12, 45, 23, 8, 67, 34)
species <- c("Carabidae_sp1", "Carabidae_sp2", "Staphylinidae_sp1")

# Basic operations on vectors
length(abundance)        # How many elements? β†’ 6
sum(abundance)           # Total β†’ 189
mean(abundance)          # Average β†’ 31.5
max(abundance)           # Maximum β†’ 67
min(abundance)           # Minimum β†’ 8

# Extracting elements (indexing)
abundance[1]             # First element β†’ 12
abundance[c(1, 3, 5)]    # Elements 1, 3, and 5
abundance[abundance > 30] # Only values > 30 β†’ 45, 67, 34

1.4 Importing Data

# Import our main dataset
# Make sure the file is in your working directory or provide full path
insect_data <- read.csv("data/raw/sample_insect_data.csv")

# More robust import with options for common issues:
insect_data <- read.csv(
  "data/raw/sample_insect_data.csv",
  header = TRUE,                    # First row contains column names
  stringsAsFactors = FALSE,         # Keep text as text (not factors)
  na.strings = c("", "NA", "N/A", "-")  # Recognize these as missing data
)

1.5 Exploring Your Data

Why explore data immediately?

Before any analysis, you need to know:

  • Did the data import correctly?
  • Are there unexpected values or errors?
  • What is the structure of your data?

Never trust data blindly - always look at it first!

# First look at your data
head(insect_data)         # First 6 rows
tail(insect_data)         # Last 6 rows
View(insect_data)         # Spreadsheet view (capital V!)

# Structure and dimensions
str(insect_data)          # Column types and preview
dim(insect_data)          # Rows Γ— columns
nrow(insect_data)         # Number of rows
ncol(insect_data)         # Number of columns
names(insect_data)        # Column names

# Summary statistics
summary(insect_data)      # Statistical summary of each column

1.6 Checking for Problems

# Missing values (NA = "Not Available")
sum(is.na(insect_data))           # Total NAs in entire dataset
colSums(is.na(insect_data))       # NAs per column

# Check categorical columns for typos
unique(insect_data$habitat)       # What habitat values exist?
unique(insect_data$order)         # What insect orders are present?

# Frequency tables
table(insect_data$habitat)        # How many records per habitat?
table(insect_data$order)          # How many records per order?

# Check numeric ranges for impossible values
range(insect_data$abundance)      # Min and max abundance
summary(insect_data$abundance)    # Full summary

# Look for suspicious values
insect_data[insect_data$abundance < 0, ]  # Negative abundances?

β˜• BREAK: 10:15 - 10:30


2 Session 2: Data Wrangling with tidyverse

Time: 10:30 - 11:30 (60 minutes)

2.1 The Pipe Operator: Your Best Friend

Why use pipes (%>%)?

Pipes make code readable by turning nested operations into a step-by-step recipe. Read %>% as β€œTHEN”.

Instead of: β€œBake(Mix(Measure(flour, sugar), eggs))”

Write: β€œTake flour and sugar, THEN measure, THEN mix with eggs, THEN bake”

library(tidyverse)

# Without pipe - confusing nested functions:
round(mean(sqrt(c(1, 4, 9, 16))), 2)

# With pipe - clear step-by-step:
c(1, 4, 9, 16) %>%    # Start with these numbers
  sqrt() %>%           # THEN take square root
  mean() %>%           # THEN calculate mean
  round(2)             # THEN round to 2 decimals

# Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)

2.2 The Core dplyr Verbs

Think of these as actions you perform on your data:

Verb Action Example Use
filter() Keep rows that match a condition Keep only forest sites
select() Keep or remove columns Keep only site, species, abundance
mutate() Create new columns Calculate log-transformed abundance
arrange() Sort rows Sort by abundance (high to low)
summarise() Collapse to summary statistics Calculate mean abundance
group_by() Group data for operations Calculate mean BY habitat

2.2.1 filter(): Keep Rows Matching Conditions

# Keep only forest sites
insect_data %>%
  filter(habitat == "forest")

# Multiple conditions with AND (comma or &)
insect_data %>%
  filter(habitat == "forest", abundance > 10)

# Multiple options with OR (|) or %in%
insect_data %>%
  filter(habitat %in% c("forest", "grassland"))

# Exclude with !=
insect_data %>%
  filter(habitat != "agriculture")

2.2.2 select(): Choose Columns

# Keep specific columns
insect_data %>%
  select(site, habitat, morphospecies, abundance)

# Remove columns with minus sign
insect_data %>%
  select(-family)

# Select helpers for patterns
insect_data %>%
  select(starts_with("site"))     # Columns starting with "site"

insect_data %>%
  select(contains("species"))     # Columns containing "species"

2.2.3 mutate(): Create New Columns

insect_data %>%
  mutate(
    # Log-transform abundance (common in ecology due to right-skewed distributions)
    log_abundance = log(abundance + 1),  # +1 because log(0) is undefined
    
    # Create combined categories
    habitat_landscape = paste(habitat, landscape, sep = "_")
  )

2.2.4 group_by() + summarise(): The Power Combination

Why is this so powerful?

In ecology, we almost always want to compare GROUPS (habitats, treatments, sites). group_by() tells R to do calculations separately for each group.

# Calculate summary statistics BY habitat
insect_data %>%
  group_by(habitat) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    n_sites = n_distinct(site),
    mean_abundance = mean(abundance),
    .groups = "drop"  # Clean up grouping after summarise
  )

# Multiple grouping variables
insect_data %>%
  group_by(habitat, order) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    .groups = "drop"
  )

2.3 Reshaping Data: Wide vs Long

Ecological Data Formats:

  • Long format: One row per observation (site-species-abundance). Good for: data entry, plotting, dplyr operations.

  • Wide format (Community Matrix): Rows = sites, Columns = species. Required for: vegan package, ordination, diversity indices.

Most raw data is long; most analyses need wide. You’ll convert between them constantly!

# LONG TO WIDE: Create a community matrix
species_wide <- insect_data %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,    # Species become columns
    values_from = abundance,        # Cell values are abundances
    values_fill = 0                 # Missing = 0 (species absent)
  )

head(species_wide)

# WIDE TO LONG: For plotting
species_long <- species_wide %>%
  pivot_longer(
    cols = -site,             # All columns except site
    names_to = "species",     # Column names go to "species"
    values_to = "abundance"   # Values go to "abundance"
  )

🍽️ LUNCH: 11:30 - 12:30


3 Session 3: Data Cleaning & Community Matrices

Time: 12:30 - 13:45 (75 minutes)

3.1 Why Data Cleaning Matters

β€œGarbage In, Garbage Out”

Taxonomic data is notoriously messy. Common problems:

  • Inconsistent spelling: β€œCarabidae sp1” vs β€œCarabidae_sp1” vs β€œcarabidae sp. 1”
  • Case sensitivity: β€œForest” vs β€œforest” are treated as different!
  • Trailing spaces: β€œforest” (with space) β‰  β€œforest”
  • Impossible values: Negative abundances, abundances of 999999

Your cleaning decisions affect your conclusions. Document what you did!

# Check for problems first
unique(insect_data$habitat)  # Look for typos, case issues

# Common cleaning operations
clean_data <- insect_data %>%
  mutate(
    # Standardize text: lowercase and remove extra spaces
    habitat = tolower(trimws(habitat)),
    order = tolower(trimws(order)),
    
    # Fix specific typos (if any exist)
    habitat = case_when(
      habitat == "forrest" ~ "forest",    # Common typo
      habitat == "grasland" ~ "grassland",
      TRUE ~ habitat                       # Keep all others as-is
    )
  ) %>%
  # Remove impossible or missing values
  filter(
    abundance >= 0,        # No negative abundances
    !is.na(abundance),     # No missing abundances
    !is.na(morphospecies)  # No missing species IDs
  )

# Verify cleaning worked
unique(clean_data$habitat)

3.2 Creating the Community Matrix

What is a Community Matrix?

The community matrix is the fundamental data structure in community ecology:

              Species_A  Species_B  Species_C  Species_D
Site_1           12          5          0          8
Site_2            0         23         15          2
Site_3            7          0          0         45
  • Rows = Sampling units (sites, traps, plots)
  • Columns = Taxa (species, morphospecies)
  • Values = Abundance (or presence/absence: 0/1)

This matrix is what ordination and diversity analyses use!

# For this workshop, we'll focus on beetles (Coleoptera)
# We'll learn HOW to choose focal taxa in Session 4

comm_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%                    # Focus on beetles
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,
    values_from = abundance,
    values_fill = 0    # CRITICAL: Missing species = 0, not NA
  ) %>%
  column_to_rownames("site")  # Site names become row names

# Check the result
dim(comm_matrix)  # Sites Γ— Species
head(comm_matrix[, 1:5])  # First 5 species

Why values_fill = 0?

In ecology, if we didn’t record a species at a site, we assume it was absent (abundance = 0), not β€œunknown” (NA).

This is an important assumption! It assumes: - Equal sampling effort across sites - Species detection is reliable

If sampling effort varied, you should use rarefaction (Session 5).

3.3 Creating Environmental Data

# Environmental data: one row per site with habitat info
env_data <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%                           # One row per site
  column_to_rownames("site")               # Site names as row names

# CRITICAL: Ensure rows match between matrices!
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]
# rownames(comm_matrix): Extracts the unique IDs (like "S01", "S02") from your insect data.
# env_data[..., ]: Selects and reorders rows in the environmental table based on those IDs
# , (The blank space):  Tells R to keep all columns (e.g., habitat, elevation, distance).
# drop = FALSE --   A "defensive" command. It ensures that even if you only have one environmental variable, R keeps the data as a data frame instead of turning it into a simple list (vector)..


# Verify alignment (MUST be TRUE!)
all(rownames(comm_matrix) == rownames(env_data))

β˜• BREAK: 13:45 - 14:00


4 Session 4: Choosing Your Focal Taxa

Time: 14:00 - 15:00 (60 minutes)

The Central Question of This Session:

Your pitfall traps caught multiple insect orders (Coleoptera, Hymenoptera, Araneae, etc.). Which groups should you analyze in detail?

Not all groups are equal! Some have ecological stories to tell; others are just noise.

4.1 The Philosophy of Focal Taxa Selection

Why Not Analyze Everything?

  1. Statistical power: Groups with few individuals give unreliable results
  2. Ecological relevance: Not all groups respond to habitat differences
  3. Sampling bias: Pitfall traps don’t catch all groups equally
  4. Focused story: Papers are clearer when they focus on specific groups

Good focal taxa have:

  • Sufficient abundance (β‰₯50 individuals total)
  • Multiple species (β‰₯5 morphospecies)
  • Presence across most sites
  • Interesting variation between habitats
  • Ecological relevance to your question

4.2 Decision Framework for Choosing Focal Taxa

╔═══════════════════════════════════════════════════════════════════════════════╗
β•‘               DECISION FLOWCHART: CHOOSING FOCAL INSECT GROUPS                β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

For each insect ORDER in your data:
            β”‚
            β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  CRITERION 1: ABUNDANCE       β”‚
    β”‚  Total individuals β‰₯ 50?      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
         NO ◄───────┴───────► YES
          β”‚                    β”‚
          β–Ό                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EXCLUDE        β”‚   β”‚  CRITERION 2: DIVERSITY       β”‚
β”‚  Too few data   β”‚   β”‚  Number of morphospecies β‰₯ 5? β”‚
β”‚  for statistics β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
                           NO ◄───────┴───────► YES
                            β”‚                    β”‚
                            β–Ό                    β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚  POSSIBLE but   β”‚   β”‚  CRITERION 3: OCCURRENCE      β”‚
                β”‚  Limited        β”‚   β”‚  Present in β‰₯ 50% of sites?   β”‚
                β”‚  (can only      β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚  analyze abund- β”‚                   β”‚
                β”‚  ance, not      β”‚        NO ◄───────┴───────► YES
                β”‚  diversity)     β”‚         β”‚                    β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β–Ό                    β–Ό
                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                               β”‚  CAUTION        β”‚   β”‚  CRITERION 4: VARIATION       β”‚
                               β”‚  May be habitat β”‚   β”‚  Shows differences between    β”‚
                               β”‚  specialist or  β”‚   β”‚  habitats? (CV > 30%)         β”‚
                               β”‚  rare           β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
                                                          NO ◄───────┴───────► YES
                                                          β”‚                    β”‚
                                                          β–Ό                    β–Ό
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚  POSSIBLE       β”‚   β”‚  EXCELLENT      β”‚
                                              β”‚  May be         β”‚   β”‚  FOCAL TAXON!   β”‚
                                              β”‚  generalist     β”‚   β”‚                 β”‚
                                              β”‚  (still         β”‚   β”‚  High priority  β”‚
                                              β”‚  interesting!)  β”‚   β”‚  for analysis   β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.3 Implementing the Decision Framework

4.3.1 Step 1: Overall Summary

# What do we have?
cat("=== DATASET OVERVIEW ===\n")
cat("Total records:", nrow(insect_data), "\n")
cat("Total individuals:", sum(insect_data$abundance), "\n")
cat("Number of sites:", n_distinct(insect_data$site), "\n")
cat("Number of habitats:", n_distinct(insect_data$habitat), "\n")
cat("Number of orders:", n_distinct(insect_data$order), "\n")
cat("Number of morphospecies:", n_distinct(insect_data$morphospecies), "\n")

# How are individuals distributed across orders?
insect_data %>%
  group_by(order) %>%
  summarise(
    total_abundance = sum(abundance),
    percent_of_total = round(sum(abundance) / sum(insect_data$abundance) * 100, 1)
  ) %>%
  arrange(desc(total_abundance))

4.3.2 Step 2: Evaluate Each Order

# Calculate all criteria for each order
order_evaluation <- insect_data %>%
  group_by(order) %>%
  summarise(
    # Criterion 1: Abundance
    total_abundance = sum(abundance),
    
    # Criterion 2: Diversity
    n_morphospecies = n_distinct(morphospecies),
    
    # Criterion 3: Occurrence
    n_sites_present = n_distinct(site[abundance > 0]),
    pct_sites = round(n_distinct(site[abundance > 0]) / n_distinct(insect_data$site) * 100, 0),
    
    .groups = "drop"
  ) %>%
  arrange(desc(total_abundance))

print(order_evaluation)

# Visualize: Which orders are most abundant?
ggplot(order_evaluation, aes(x = reorder(order, total_abundance), 
                              y = total_abundance)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = paste0("n=", n_morphospecies, " spp")), 
            hjust = -0.1, size = 3) +
  coord_flip() +
  labs(
    x = "Order",
    y = "Total Abundance",
    title = "Which Insect Orders Dominate Your Samples?",
    subtitle = "Numbers show morphospecies count"
  ) +
  theme_bw()

4.3.3 Step 3: Calculate Habitat Variation

# For Criterion 4: Do abundances vary between habitats?
habitat_variation <- insect_data %>%
  group_by(order, habitat) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  group_by(order) %>%
  summarise(
    # Coefficient of Variation: higher = more variation between habitats
    cv_abundance = round(sd(abundance) / mean(abundance) * 100, 1),
    # Which habitat has most individuals?
    dominant_habitat = habitat[which.max(abundance)],
    # How much more in best vs worst habitat?
    max_min_ratio = round(max(abundance) / (min(abundance) + 1), 1),
    .groups = "drop"
  )

print(habitat_variation)

Interpreting Coefficient of Variation (CV):

  • CV < 30%: Low variation - this group is similar across habitats (generalist)
  • CV 30-50%: Moderate variation - some habitat preference
  • CV > 50%: High variation - strong habitat specialist

Both patterns are ecologically interesting! Generalists tell us habitats are functionally similar for that group; specialists show habitat filtering.

4.3.4 Step 4: Score and Rank Orders

# Combine criteria and score each order
focal_taxa_scores <- order_evaluation %>%
  left_join(habitat_variation, by = "order") %>%
  mutate(
    # Score each criterion (0-3 points each)
    score_abundance = case_when(
      total_abundance >= 100 ~ 3,
      total_abundance >= 50 ~ 2,
      total_abundance >= 30 ~ 1,
      TRUE ~ 0
    ),
    score_diversity = case_when(
      n_morphospecies >= 10 ~ 3,
      n_morphospecies >= 5 ~ 2,
      n_morphospecies >= 3 ~ 1,
      TRUE ~ 0
    ),
    score_occurrence = case_when(
      pct_sites >= 80 ~ 3,
      pct_sites >= 50 ~ 2,
      pct_sites >= 30 ~ 1,
      TRUE ~ 0
    ),
    score_variation = case_when(
      cv_abundance >= 50 ~ 3,
      cv_abundance >= 30 ~ 2,
      cv_abundance >= 15 ~ 1,
      TRUE ~ 0
    ),
    
    # Total score (max = 12)
    total_score = score_abundance + score_diversity + score_occurrence + score_variation,
    
    # Final recommendation
    recommendation = case_when(
      total_score >= 10 ~ "β˜…β˜…β˜… Highly recommended",
      total_score >= 7 ~ "β˜…β˜… Recommended",
      total_score >= 4 ~ "β˜… Possible with caution",
      TRUE ~ "βœ— Not recommended"
    )
  ) %>%
  arrange(desc(total_score))

# Display decision table
cat("\n========================================\n")
cat("   FOCAL TAXA RECOMMENDATION TABLE\n")
cat("========================================\n\n")

focal_taxa_scores %>%
  select(order, total_abundance, n_morphospecies, pct_sites, 
         cv_abundance, total_score, recommendation) %>%
  print(n = Inf)

4.3.5 Step 5: Visualize the Decision

# Create visual summary of recommendations
ggplot(focal_taxa_scores, aes(x = reorder(order, total_score), 
                               y = total_score,
                               fill = recommendation)) +
  geom_col() +
  geom_hline(yintercept = c(4, 7, 10), linetype = "dashed", alpha = 0.5) +
  coord_flip() +
  scale_fill_manual(values = c(
    "β˜…β˜…β˜… Highly recommended" = "#27ae60",
    "β˜…β˜… Recommended" = "#3498db",
    "β˜… Possible with caution" = "#f39c12",
    "βœ— Not recommended" = "#95a5a6"
  )) +
  labs(
    x = "Order",
    y = "Suitability Score (max = 12)",
    title = "Which Insect Groups Should You Analyze?",
    subtitle = "Based on abundance, diversity, occurrence, and habitat variation",
    fill = "Recommendation"
  ) +
  theme_bw() +
  theme(legend.position = "bottom")

4.4 Ecological Relevance of Common Pitfall Groups

Common Pitfall Trap Groups and Their Ecological Value:

Group Pitfall Suitability Ecological Role Indicator Value
Carabidae (ground beetles) β˜…β˜…β˜… Excellent Predators, seed eaters High - respond to disturbance, moisture
Formicidae (ants) β˜…β˜…β˜… Excellent Ecosystem engineers High - sensitive to temperature, disturbance
Araneae (spiders) β˜…β˜… Good Predators High - indicate prey availability
Staphylinidae (rove beetles) β˜…β˜… Good Predators, decomposers Moderate - very diverse
Scarabaeidae (dung beetles) β˜…β˜… Good Decomposers High - depend on mammal presence
Orthoptera (grasshoppers) β˜… Moderate Herbivores Moderate - vegetation-dependent
Diptera (flies) βœ— Poor Various Low - mostly fly away from traps
Lepidoptera (moths) βœ— Poor Herbivores, pollinators Low - rarely caught in pitfalls

4.5 Preview: Patterns in Your Chosen Group

# Once you've chosen a focal group (e.g., Coleoptera), preview the patterns

# Create community matrix for chosen group
focal_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%  # Replace with your chosen order
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(names_from = morphospecies, values_from = abundance, values_fill = 0) %>%
  column_to_rownames("site")

# Get environmental data
focal_env <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%
  column_to_rownames("site")
focal_env <- focal_env[rownames(focal_matrix), ]

# Quick NMDS preview (we'll learn this properly tomorrow!)
library(vegan)
set.seed(123)
nmds_preview <- metaMDS(focal_matrix, distance = "bray", k = 2, trymax = 50)

# Plot preview
nmds_scores <- as.data.frame(scores(nmds_preview, display = "sites"))
nmds_scores$habitat <- focal_env$habitat

ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, color = habitat)) +
  geom_point(size = 4) +
  stat_ellipse(linetype = 2) +
  labs(
    title = "Preview: Do Beetle Communities Differ by Habitat?",
    subtitle = paste("NMDS stress =", round(nmds_preview$stress, 3),
                     "- We'll interpret this properly tomorrow!"),
    color = "Habitat"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  coord_equal()

End of Day 1 Checkpoint:

Before tomorrow, you should have:

  1. βœ… Imported and cleaned your data
  2. βœ… Explored which insect orders are present
  3. βœ… Applied the decision framework to choose focal taxa
  4. βœ… Created a community matrix for your focal group
  5. βœ… Documented your focal taxa choice and rationale

Objects we’ll use tomorrow:

  • insect_data - The raw data
  • comm_matrix - Community matrix for your focal group
  • env_data - Environmental data matching the community matrix

END OF DAY 1


Happy coding and happy bug hunting! πŸ›