Workshop Overview

Who Is This Workshop For?

This workshop is designed for entomology students who are new to R and may need a refresher on ecological concepts related to biodiversity data analysis. No prior R experience is required!

By the end of this workshop, you will be able to:

Import and clean insect survey data
Decide which insect groups have “interesting stories” worth analyzing
Calculate and interpret diversity metrics
Visualize community patterns
Test whether habitats differ in their insect communities
Interpret results in an ecological context

The Ecological Question Behind Everything

When we set pitfall traps across different habitats, we’re fundamentally asking:

“Do different environments support different insect communities, and why?”

This question connects to foundational ecological theories:

Key Ecological Concepts:

Niche Theory: Each species has specific environmental requirements (temperature, moisture, food). Species occur where conditions match their niche.
Habitat Filtering: The environment “filters” which species can survive. Forest conditions filter out sun-loving species; agricultural disturbance filters out sensitive specialists.
Species-Area Relationship: Larger, more complex habitats typically support more species.
Indicator Species: Some species are sensitive to environmental conditions and can indicate habitat quality or disturbance level.

Our statistical analyses are tools to detect these ecological patterns and test whether the differences we observe are real or just due to chance.

The Complete Analysis Roadmap

Before writing any code, understand the journey from raw data to ecological insights:

╔═══════════════════════════════════════════════════════════════════════════════╗
║                    COMPLETE ANALYSIS WORKFLOW                                  ║
╚═══════════════════════════════════════════════════════════════════════════════╝

                              ┌─────────────────┐
                              │   RAW DATA      │
                              │  (Your pitfall  │
                              │   trap counts)  │
                              └────────┬────────┘
                                       │
        ┌──────────────────────────────┼──────────────────────────────┐
        │                              ▼                              │
        │  ╔═══════════════════════════════════════════════════════╗  │
        │  ║  PHASE 1: DATA PREPARATION (Day 1, Sessions 1-3)      ║  │
        │  ╠═══════════════════════════════════════════════════════╣  │
        │  ║  • Import data into R                                 ║  │
        │  ║  • Check for errors (typos, impossible values)        ║  │
        │  ║  • Standardize names (taxonomy is messy!)             ║  │
        │  ║  • Create community matrix (sites × species)          ║  │
        │  ║                                                       ║  │
        │  ║  KEY QUESTION: "Is my data clean and trustworthy?"    ║  │
        │  ╚═══════════════════════════════════════════════════════╝  │
        │                              │                              │
        │                              ▼                              │
        │  ╔═══════════════════════════════════════════════════════╗  │
        │  ║  PHASE 2: DATA EXPLORATION (Day 1, Session 4)         ║  │
        │  ╠═══════════════════════════════════════════════════════╣  │
        │  ║  • Summarize: How many species? How many individuals? ║  │
        │  ║  • Explore: Which insect ORDERS are abundant?         ║  │
        │  ║  • Decide: Which groups have "interesting stories"?   ║  │
        │  ║  • Visualize: What patterns can we SEE?               ║  │
        │  ║                                                       ║  │
        │  ║  KEY QUESTION: "Which insect groups should I focus    ║  │
        │  ║                 my analysis on?"                      ║  │
        │  ╚═══════════════════════════════════════════════════════╝  │
        │                              │                              │
   D    │                              ▼                              │
   A    │  ╔═══════════════════════════════════════════════════════╗  │
   Y    │  ║  PHASE 3: DIVERSITY ANALYSIS (Day 2, Session 5)       ║  │
        │  ╠═══════════════════════════════════════════════════════╣  │
   1    │  ║  ALPHA DIVERSITY (within-site)                        ║  │
        │  ║  • How many species at each site? (Richness)          ║  │
   |    │  ║  • How diverse is each site? (Shannon, Simpson)       ║  │
        │  ║  • Are differences due to sampling effort?            ║  │
   D    │  ║    (Rarefaction)                                      ║  │
   A    │  ║                                                       ║  │
   Y    │  ║  KEY QUESTION: "Which habitats are most diverse?"     ║  │
        │  ╚═══════════════════════════════════════════════════════╝  │
   2    │                              │                              │
        │                              ▼                              │
        │  ╔═══════════════════════════════════════════════════════╗  │
        │  ║  PHASE 4: COMMUNITY ANALYSIS (Day 2, Sessions 6-7)    ║  │
        │  ╠═══════════════════════════════════════════════════════╣  │
        │  ║  BETA DIVERSITY (between-sites)                       ║  │
        │  ║  • How different are communities? (Distance matrix)   ║  │
        │  ║  • Can we VISUALIZE the pattern? (NMDS/PCoA)          ║  │
        │  ║  • Is the pattern SIGNIFICANT? (PERMANOVA)            ║  │
        │  ║  • WHICH species drive differences? (SIMPER)          ║  │
        │  ║                                                       ║  │
        │  ║  KEY QUESTION: "Do habitats have different            ║  │
        │  ║                 communities? Which species matter?"   ║  │
        │  ╚═══════════════════════════════════════════════════════╝  │
        │                              │                              │
        │                              ▼                              │
        │  ╔═══════════════════════════════════════════════════════╗  │
        │  ║  PHASE 5: INTERPRETATION (Day 2, Session 8)           ║  │
        │  ╠═══════════════════════════════════════════════════════╣  │
        │  ║  • What do the numbers MEAN ecologically?             ║  │
        │  ║  • What MECHANISMS might explain the patterns?        ║  │
        │  ║  • What are the LIMITATIONS?                          ║  │
        │  ║  • What NEW QUESTIONS arise?                          ║  │
        │  ║                                                       ║  │
        │  ║  KEY QUESTION: "So what? Why does this matter?"       ║  │
        │  ╚═══════════════════════════════════════════════════════╝  │
        │                              │                              │
        └──────────────────────────────┼──────────────────────────────┘
                                       ▼
                              ┌─────────────────┐
                              │  PUBLICATION /  │
                              │  THESIS CHAPTER │
                              └─────────────────┘

Workshop Schedule

╔══════════════════════════════════════════════════════════════════════════════╗
║  DAY 1: DATA FOUNDATIONS & EXPLORATION           9:00 - 15:00                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  09:00 - 10:15  Session 1: R Basics & Importing Data               75 min    ║
║  10:15 - 10:30  ☕ Break                                            15 min    ║
║  10:30 - 11:30  Session 2: Data Wrangling with tidyverse           60 min    ║
║  11:30 - 12:30  🍽️ Lunch                                            60 min   ║
║  12:30 - 13:45  Session 3: Data Cleaning & Community Matrices      75 min    ║
║  13:45 - 14:00  ☕ Break                                            15 min    ║
║  14:00 - 15:00  Session 4: Choosing Your Focal Taxa                60 min    ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  DAY 2: ANALYSIS & INTERPRETATION                9:00 - 15:00                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  09:00 - 10:00  Session 5: Diversity Metrics & Visualization       60 min    ║
║  10:00 - 10:15  ☕ Break                                            15 min    ║
║  10:15 - 11:15  Session 6: Ordination - Seeing Community Patterns  60 min    ║
║  11:15 - 12:15  🍽️ Lunch                                            60 min   ║
║  12:15 - 13:30  Session 7: Statistical Testing                     75 min    ║
║  13:30 - 13:45  ☕ Break                                            15 min    ║
║  13:45 - 14:45  Session 8: Interpretation & Synthesis              60 min    ║
║  14:45 - 15:00  Wrap-up & Next Steps                               15 min    ║
╚══════════════════════════════════════════════════════════════════════════════╝

Our Dataset

We’ll use sample_insect_data.csv - a pitfall trap survey of ground-dwelling arthropods.

About Pitfall Traps:

Pitfall traps are cups buried flush with the ground surface. Ground-active arthropods fall in and are collected. This method is:

Good for: Ground beetles (Carabidae), ants (Formicidae), spiders (Araneae), rove beetles (Staphylinidae)
Poor for: Flying insects, canopy-dwellers, sedentary species
Biased by: Trap activity (more active = more caught), body size, weather

Always consider sampling bias when interpreting results!

Column	Description
site	Site identifier (S01-S12)
habitat	Habitat type (forest, grassland, agriculture)
landscape	Landscape complexity (complex, simple)
order	Insect order (Coleoptera, Hymenoptera, etc.)
family	Insect family
morphospecies	Species identifier
abundance	Number of individuals

DAY 1: DATA FOUNDATIONS & EXPLORATION

1 Session 1: R Basics & Importing Data

Time: 09:00 - 10:15 (75 minutes)

1.1 Why R for Ecology?

Why do we use R instead of Excel?

Reproducibility: Your analysis is saved as a script. Anyone (including future you) can re-run it exactly.
Ecological packages: R has specialized packages like vegan designed specifically for community ecology.
Handling complexity: Ecological data often has hundreds of species and complex designs. R handles this easily.
Publication-quality figures: R creates beautiful, customizable plots ready for journals.

1.2 Loading Packages

# Packages are collections of functions that extend R's capabilities
# We load them at the start of EVERY session

library(tidyverse)    # Data wrangling and visualization
library(vegan)        # Community ecology analyses (diversity, ordination)
library(ape)          # Phylogenetic tools and PCoA ordination

# If you get "there is no package called 'x'", install it first:
# install.packages("x")

Why these packages?

tidyverse: Makes data manipulation intuitive with “verbs” like filter, select, mutate
vegan: The gold standard for community ecology - diversity indices, NMDS, PERMANOVA
ape: Provides PCoA ordination and phylogenetic analysis tools

1.3 Basic R Operations

# R as a calculator
2 + 3         # Addition
10 / 2        # Division
5^2           # Power (5 squared = 25)
sqrt(16)      # Square root

# Creating objects with <- (assignment operator)
species_count <- 42           # Store a number
site_name <- "Forest_A"       # Store text

# Vectors: collections of values (fundamental to R!)
abundance <- c(12, 45, 23, 8, 67, 34)
species <- c("Carabidae_sp1", "Carabidae_sp2", "Staphylinidae_sp1")

# Basic operations on vectors
length(abundance)        # How many elements? → 6
sum(abundance)           # Total → 189
mean(abundance)          # Average → 31.5
max(abundance)           # Maximum → 67
min(abundance)           # Minimum → 8

# Extracting elements (indexing)
abundance[1]             # First element → 12
abundance[c(1, 3, 5)]    # Elements 1, 3, and 5
abundance[abundance > 30] # Only values > 30 → 45, 67, 34

1.4 Importing Data

# Import our main dataset
# Make sure the file is in your working directory or provide full path
insect_data <- read.csv("data/raw/sample_insect_data.csv")

# More robust import with options for common issues:
insect_data <- read.csv(
  "data/raw/sample_insect_data.csv",
  header = TRUE,                    # First row contains column names
  stringsAsFactors = FALSE,         # Keep text as text (not factors)
  na.strings = c("", "NA", "N/A", "-")  # Recognize these as missing data
)

1.5 Exploring Your Data

Why explore data immediately?

Before any analysis, you need to know:

Did the data import correctly?
Are there unexpected values or errors?
What is the structure of your data?

Never trust data blindly - always look at it first!

# First look at your data
head(insect_data)         # First 6 rows
tail(insect_data)         # Last 6 rows
View(insect_data)         # Spreadsheet view (capital V!)

# Structure and dimensions
str(insect_data)          # Column types and preview
dim(insect_data)          # Rows × columns
nrow(insect_data)         # Number of rows
ncol(insect_data)         # Number of columns
names(insect_data)        # Column names

# Summary statistics
summary(insect_data)      # Statistical summary of each column

1.6 Checking for Problems

# Missing values (NA = "Not Available")
sum(is.na(insect_data))           # Total NAs in entire dataset
colSums(is.na(insect_data))       # NAs per column

# Check categorical columns for typos
unique(insect_data$habitat)       # What habitat values exist?
unique(insect_data$order)         # What insect orders are present?

# Frequency tables
table(insect_data$habitat)        # How many records per habitat?
table(insect_data$order)          # How many records per order?

# Check numeric ranges for impossible values
range(insect_data$abundance)      # Min and max abundance
summary(insect_data$abundance)    # Full summary

# Look for suspicious values
insect_data[insect_data$abundance < 0, ]  # Negative abundances?

☕ BREAK: 10:15 - 10:30

2 Session 2: Data Wrangling with tidyverse

Time: 10:30 - 11:30 (60 minutes)

2.1 The Pipe Operator: Your Best Friend

Why use pipes (%>%)?

Pipes make code readable by turning nested operations into a step-by-step recipe. Read %>% as “THEN”.

Instead of: “Bake(Mix(Measure(flour, sugar), eggs))”

Write: “Take flour and sugar, THEN measure, THEN mix with eggs, THEN bake”

library(tidyverse)

# Without pipe - confusing nested functions:
round(mean(sqrt(c(1, 4, 9, 16))), 2)

# With pipe - clear step-by-step:
c(1, 4, 9, 16) %>%    # Start with these numbers
  sqrt() %>%           # THEN take square root
  mean() %>%           # THEN calculate mean
  round(2)             # THEN round to 2 decimals

# Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)

2.2 The Core dplyr Verbs

Think of these as actions you perform on your data:

Verb	Action	Example Use
`filter()`	Keep rows that match a condition	Keep only forest sites
`select()`	Keep or remove columns	Keep only site, species, abundance
`mutate()`	Create new columns	Calculate log-transformed abundance
`arrange()`	Sort rows	Sort by abundance (high to low)
`summarise()`	Collapse to summary statistics	Calculate mean abundance
`group_by()`	Group data for operations	Calculate mean BY habitat

2.2.1 filter(): Keep Rows Matching Conditions

# Keep only forest sites
insect_data %>%
  filter(habitat == "forest")

# Multiple conditions with AND (comma or &)
insect_data %>%
  filter(habitat == "forest", abundance > 10)

# Multiple options with OR (|) or %in%
insect_data %>%
  filter(habitat %in% c("forest", "grassland"))

# Exclude with !=
insect_data %>%
  filter(habitat != "agriculture")

2.2.2 select(): Choose Columns

# Keep specific columns
insect_data %>%
  select(site, habitat, morphospecies, abundance)

# Remove columns with minus sign
insect_data %>%
  select(-family)

# Select helpers for patterns
insect_data %>%
  select(starts_with("site"))     # Columns starting with "site"

insect_data %>%
  select(contains("species"))     # Columns containing "species"

2.2.3 mutate(): Create New Columns

insect_data %>%
  mutate(
    # Log-transform abundance (common in ecology due to right-skewed distributions)
    log_abundance = log(abundance + 1),  # +1 because log(0) is undefined
    
    # Create combined categories
    habitat_landscape = paste(habitat, landscape, sep = "_")
  )

2.2.4 group_by() + summarise(): The Power Combination

Why is this so powerful?

In ecology, we almost always want to compare GROUPS (habitats, treatments, sites). group_by() tells R to do calculations separately for each group.

# Calculate summary statistics BY habitat
insect_data %>%
  group_by(habitat) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    n_sites = n_distinct(site),
    mean_abundance = mean(abundance),
    .groups = "drop"  # Clean up grouping after summarise
  )

# Multiple grouping variables
insect_data %>%
  group_by(habitat, order) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    .groups = "drop"
  )

2.3 Reshaping Data: Wide vs Long

Ecological Data Formats:

Long format: One row per observation (site-species-abundance). Good for: data entry, plotting, dplyr operations.
Wide format (Community Matrix): Rows = sites, Columns = species. Required for: vegan package, ordination, diversity indices.

Most raw data is long; most analyses need wide. You’ll convert between them constantly!

# LONG TO WIDE: Create a community matrix
species_wide <- insect_data %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,    # Species become columns
    values_from = abundance,        # Cell values are abundances
    values_fill = 0                 # Missing = 0 (species absent)
  )

head(species_wide)

# WIDE TO LONG: For plotting
species_long <- species_wide %>%
  pivot_longer(
    cols = -site,             # All columns except site
    names_to = "species",     # Column names go to "species"
    values_to = "abundance"   # Values go to "abundance"
  )

🍽️ LUNCH: 11:30 - 12:30

3 Session 3: Data Cleaning & Community Matrices

Time: 12:30 - 13:45 (75 minutes)

3.1 Why Data Cleaning Matters

“Garbage In, Garbage Out”

Taxonomic data is notoriously messy. Common problems:

Inconsistent spelling: “Carabidae sp1” vs “Carabidae_sp1” vs “carabidae sp. 1”
Case sensitivity: “Forest” vs “forest” are treated as different!
Trailing spaces: “forest” (with space) ≠ “forest”
Impossible values: Negative abundances, abundances of 999999

Your cleaning decisions affect your conclusions. Document what you did!

# Check for problems first
unique(insect_data$habitat)  # Look for typos, case issues

# Common cleaning operations
clean_data <- insect_data %>%
  mutate(
    # Standardize text: lowercase and remove extra spaces
    habitat = tolower(trimws(habitat)),
    order = tolower(trimws(order)),
    
    # Fix specific typos (if any exist)
    habitat = case_when(
      habitat == "forrest" ~ "forest",    # Common typo
      habitat == "grasland" ~ "grassland",
      TRUE ~ habitat                       # Keep all others as-is
    )
  ) %>%
  # Remove impossible or missing values
  filter(
    abundance >= 0,        # No negative abundances
    !is.na(abundance),     # No missing abundances
    !is.na(morphospecies)  # No missing species IDs
  )

# Verify cleaning worked
unique(clean_data$habitat)

3.2 Creating the Community Matrix

What is a Community Matrix?

The community matrix is the fundamental data structure in community ecology:

              Species_A  Species_B  Species_C  Species_D
Site_1           12          5          0          8
Site_2            0         23         15          2
Site_3            7          0          0         45

Rows = Sampling units (sites, traps, plots)
Columns = Taxa (species, morphospecies)
Values = Abundance (or presence/absence: 0/1)

This matrix is what ordination and diversity analyses use!

# For this workshop, we'll focus on beetles (Coleoptera)
# We'll learn HOW to choose focal taxa in Session 4

comm_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%                    # Focus on beetles
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,
    values_from = abundance,
    values_fill = 0    # CRITICAL: Missing species = 0, not NA
  ) %>%
  column_to_rownames("site")  # Site names become row names

# Check the result
dim(comm_matrix)  # Sites × Species
head(comm_matrix[, 1:5])  # First 5 species

Why values_fill = 0?

In ecology, if we didn’t record a species at a site, we assume it was absent (abundance = 0), not “unknown” (NA).

This is an important assumption! It assumes: - Equal sampling effort across sites - Species detection is reliable

If sampling effort varied, you should use rarefaction (Session 5).

3.3 Creating Environmental Data

# Environmental data: one row per site with habitat info
env_data <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%                           # One row per site
  column_to_rownames("site")               # Site names as row names

# CRITICAL: Ensure rows match between matrices!
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]
# rownames(comm_matrix): Extracts the unique IDs (like "S01", "S02") from your insect data.
# env_data[..., ]: Selects and reorders rows in the environmental table based on those IDs
# , (The blank space):  Tells R to keep all columns (e.g., habitat, elevation, distance).
# drop = FALSE --   A "defensive" command. It ensures that even if you only have one environmental variable, R keeps the data as a data frame instead of turning it into a simple list (vector)..


# Verify alignment (MUST be TRUE!)
all(rownames(comm_matrix) == rownames(env_data))

☕ BREAK: 13:45 - 14:00

4 Session 4: Choosing Your Focal Taxa

Time: 14:00 - 15:00 (60 minutes)

The Central Question of This Session:

Your pitfall traps caught multiple insect orders (Coleoptera, Hymenoptera, Araneae, etc.). Which groups should you analyze in detail?

Not all groups are equal! Some have ecological stories to tell; others are just noise.

4.1 The Philosophy of Focal Taxa Selection

Why Not Analyze Everything?

Statistical power: Groups with few individuals give unreliable results
Ecological relevance: Not all groups respond to habitat differences
Sampling bias: Pitfall traps don’t catch all groups equally
Focused story: Papers are clearer when they focus on specific groups

Good focal taxa have:

Sufficient abundance (≥50 individuals total)
Multiple species (≥5 morphospecies)
Presence across most sites
Interesting variation between habitats
Ecological relevance to your question

4.2 Decision Framework for Choosing Focal Taxa

╔═══════════════════════════════════════════════════════════════════════════════╗
║               DECISION FLOWCHART: CHOOSING FOCAL INSECT GROUPS                ║
╚═══════════════════════════════════════════════════════════════════════════════╝

For each insect ORDER in your data:
            │
            ▼
    ┌───────────────────────────────┐
    │  CRITERION 1: ABUNDANCE       │
    │  Total individuals ≥ 50?      │
    └───────────────┬───────────────┘
                    │
         NO ◄───────┴───────► YES
          │                    │
          ▼                    ▼
┌─────────────────┐   ┌───────────────────────────────┐
│  EXCLUDE        │   │  CRITERION 2: DIVERSITY       │
│  Too few data   │   │  Number of morphospecies ≥ 5? │
│  for statistics │   └───────────────┬───────────────┘
└─────────────────┘                   │
                           NO ◄───────┴───────► YES
                            │                    │
                            ▼                    ▼
                ┌─────────────────┐   ┌───────────────────────────────┐
                │  POSSIBLE but   │   │  CRITERION 3: OCCURRENCE      │
                │  Limited        │   │  Present in ≥ 50% of sites?   │
                │  (can only      │   └───────────────┬───────────────┘
                │  analyze abund- │                   │
                │  ance, not      │        NO ◄───────┴───────► YES
                │  diversity)     │         │                    │
                └─────────────────┘         ▼                    ▼
                               ┌─────────────────┐   ┌───────────────────────────────┐
                               │  CAUTION        │   │  CRITERION 4: VARIATION       │
                               │  May be habitat │   │  Shows differences between    │
                               │  specialist or  │   │  habitats? (CV > 30%)         │
                               │  rare           │   └───────────────┬───────────────┘
                               └─────────────────┘                   │
                                                          NO ◄───────┴───────► YES
                                                          │                    │
                                                          ▼                    ▼
                                              ┌─────────────────┐   ┌─────────────────┐
                                              │  POSSIBLE       │   │  EXCELLENT      │
                                              │  May be         │   │  FOCAL TAXON!   │
                                              │  generalist     │   │                 │
                                              │  (still         │   │  High priority  │
                                              │  interesting!)  │   │  for analysis   │
                                              └─────────────────┘   └─────────────────┘

4.3 Implementing the Decision Framework

4.3.1 Step 1: Overall Summary

# What do we have?
cat("=== DATASET OVERVIEW ===\n")
cat("Total records:", nrow(insect_data), "\n")
cat("Total individuals:", sum(insect_data$abundance), "\n")
cat("Number of sites:", n_distinct(insect_data$site), "\n")
cat("Number of habitats:", n_distinct(insect_data$habitat), "\n")
cat("Number of orders:", n_distinct(insect_data$order), "\n")
cat("Number of morphospecies:", n_distinct(insect_data$morphospecies), "\n")

# How are individuals distributed across orders?
insect_data %>%
  group_by(order) %>%
  summarise(
    total_abundance = sum(abundance),
    percent_of_total = round(sum(abundance) / sum(insect_data$abundance) * 100, 1)
  ) %>%
  arrange(desc(total_abundance))

4.3.2 Step 2: Evaluate Each Order

# Calculate all criteria for each order
order_evaluation <- insect_data %>%
  group_by(order) %>%
  summarise(
    # Criterion 1: Abundance
    total_abundance = sum(abundance),
    
    # Criterion 2: Diversity
    n_morphospecies = n_distinct(morphospecies),
    
    # Criterion 3: Occurrence
    n_sites_present = n_distinct(site[abundance > 0]),
    pct_sites = round(n_distinct(site[abundance > 0]) / n_distinct(insect_data$site) * 100, 0),
    
    .groups = "drop"
  ) %>%
  arrange(desc(total_abundance))

print(order_evaluation)

# Visualize: Which orders are most abundant?
ggplot(order_evaluation, aes(x = reorder(order, total_abundance), 
                              y = total_abundance)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = paste0("n=", n_morphospecies, " spp")), 
            hjust = -0.1, size = 3) +
  coord_flip() +
  labs(
    x = "Order",
    y = "Total Abundance",
    title = "Which Insect Orders Dominate Your Samples?",
    subtitle = "Numbers show morphospecies count"
  ) +
  theme_bw()

4.3.3 Step 3: Calculate Habitat Variation

# For Criterion 4: Do abundances vary between habitats?
habitat_variation <- insect_data %>%
  group_by(order, habitat) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  group_by(order) %>%
  summarise(
    # Coefficient of Variation: higher = more variation between habitats
    cv_abundance = round(sd(abundance) / mean(abundance) * 100, 1),
    # Which habitat has most individuals?
    dominant_habitat = habitat[which.max(abundance)],
    # How much more in best vs worst habitat?
    max_min_ratio = round(max(abundance) / (min(abundance) + 1), 1),
    .groups = "drop"
  )

print(habitat_variation)

Interpreting Coefficient of Variation (CV):

CV < 30%: Low variation - this group is similar across habitats (generalist)
CV 30-50%: Moderate variation - some habitat preference
CV > 50%: High variation - strong habitat specialist

Both patterns are ecologically interesting! Generalists tell us habitats are functionally similar for that group; specialists show habitat filtering.

4.3.4 Step 4: Score and Rank Orders

# Combine criteria and score each order
focal_taxa_scores <- order_evaluation %>%
  left_join(habitat_variation, by = "order") %>%
  mutate(
    # Score each criterion (0-3 points each)
    score_abundance = case_when(
      total_abundance >= 100 ~ 3,
      total_abundance >= 50 ~ 2,
      total_abundance >= 30 ~ 1,
      TRUE ~ 0
    ),
    score_diversity = case_when(
      n_morphospecies >= 10 ~ 3,
      n_morphospecies >= 5 ~ 2,
      n_morphospecies >= 3 ~ 1,
      TRUE ~ 0
    ),
    score_occurrence = case_when(
      pct_sites >= 80 ~ 3,
      pct_sites >= 50 ~ 2,
      pct_sites >= 30 ~ 1,
      TRUE ~ 0
    ),
    score_variation = case_when(
      cv_abundance >= 50 ~ 3,
      cv_abundance >= 30 ~ 2,
      cv_abundance >= 15 ~ 1,
      TRUE ~ 0
    ),
    
    # Total score (max = 12)
    total_score = score_abundance + score_diversity + score_occurrence + score_variation,
    
    # Final recommendation
    recommendation = case_when(
      total_score >= 10 ~ "★★★ Highly recommended",
      total_score >= 7 ~ "★★ Recommended",
      total_score >= 4 ~ "★ Possible with caution",
      TRUE ~ "✗ Not recommended"
    )
  ) %>%
  arrange(desc(total_score))

# Display decision table
cat("\n========================================\n")
cat("   FOCAL TAXA RECOMMENDATION TABLE\n")
cat("========================================\n\n")

focal_taxa_scores %>%
  select(order, total_abundance, n_morphospecies, pct_sites, 
         cv_abundance, total_score, recommendation) %>%
  print(n = Inf)

4.3.5 Step 5: Visualize the Decision

# Create visual summary of recommendations
ggplot(focal_taxa_scores, aes(x = reorder(order, total_score), 
                               y = total_score,
                               fill = recommendation)) +
  geom_col() +
  geom_hline(yintercept = c(4, 7, 10), linetype = "dashed", alpha = 0.5) +
  coord_flip() +
  scale_fill_manual(values = c(
    "★★★ Highly recommended" = "#27ae60",
    "★★ Recommended" = "#3498db",
    "★ Possible with caution" = "#f39c12",
    "✗ Not recommended" = "#95a5a6"
  )) +
  labs(
    x = "Order",
    y = "Suitability Score (max = 12)",
    title = "Which Insect Groups Should You Analyze?",
    subtitle = "Based on abundance, diversity, occurrence, and habitat variation",
    fill = "Recommendation"
  ) +
  theme_bw() +
  theme(legend.position = "bottom")

4.4 Ecological Relevance of Common Pitfall Groups

Common Pitfall Trap Groups and Their Ecological Value:

Group	Pitfall Suitability	Ecological Role	Indicator Value
Carabidae (ground beetles)	★★★ Excellent	Predators, seed eaters	High - respond to disturbance, moisture
Formicidae (ants)	★★★ Excellent	Ecosystem engineers	High - sensitive to temperature, disturbance
Araneae (spiders)	★★ Good	Predators	High - indicate prey availability
Staphylinidae (rove beetles)	★★ Good	Predators, decomposers	Moderate - very diverse
Scarabaeidae (dung beetles)	★★ Good	Decomposers	High - depend on mammal presence
Orthoptera (grasshoppers)	★ Moderate	Herbivores	Moderate - vegetation-dependent
Diptera (flies)	✗ Poor	Various	Low - mostly fly away from traps
Lepidoptera (moths)	✗ Poor	Herbivores, pollinators	Low - rarely caught in pitfalls

4.5 Preview: Patterns in Your Chosen Group

# Once you've chosen a focal group (e.g., Coleoptera), preview the patterns

# Create community matrix for chosen group
focal_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%  # Replace with your chosen order
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(names_from = morphospecies, values_from = abundance, values_fill = 0) %>%
  column_to_rownames("site")

# Get environmental data
focal_env <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%
  column_to_rownames("site")
focal_env <- focal_env[rownames(focal_matrix), ]

# Quick NMDS preview (we'll learn this properly tomorrow!)
library(vegan)
set.seed(123)
nmds_preview <- metaMDS(focal_matrix, distance = "bray", k = 2, trymax = 50)

# Plot preview
nmds_scores <- as.data.frame(scores(nmds_preview, display = "sites"))
nmds_scores$habitat <- focal_env$habitat

ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, color = habitat)) +
  geom_point(size = 4) +
  stat_ellipse(linetype = 2) +
  labs(
    title = "Preview: Do Beetle Communities Differ by Habitat?",
    subtitle = paste("NMDS stress =", round(nmds_preview$stress, 3),
                     "- We'll interpret this properly tomorrow!"),
    color = "Habitat"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  coord_equal()

End of Day 1 Checkpoint:

Before tomorrow, you should have:

✅ Imported and cleaned your data
✅ Explored which insect orders are present
✅ Applied the decision framework to choose focal taxa
✅ Created a community matrix for your focal group
✅ Documented your focal taxa choice and rationale

Objects we’ll use tomorrow:

insect_data - The raw data
comm_matrix - Community matrix for your focal group
env_data - Environmental data matching the community matrix

END OF DAY 1

DAY 2: ANALYSIS & INTERPRETATION

5 Session 5: Diversity Metrics & Visualization

Time: 09:00 - 10:00 (60 minutes)

5.1 Day 2 Setup

# Load packages
library(tidyverse)
library(vegan)
library(ape)
set.seed(123)

# If starting fresh, recreate from Day 1:
insect_data <- read.csv("data/raw/sample_insect_data.csv")

# Community matrix (using beetles as our focal group)
comm_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(names_from = morphospecies, values_from = abundance, values_fill = 0) %>%
  column_to_rownames("site")

# Environmental data
env_data <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%
  column_to_rownames("site")
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]

# Verify alignment
stopifnot(all(rownames(comm_matrix) == rownames(env_data)))
cat("✓ Data loaded successfully!\n")
cat("  Community matrix:", nrow(comm_matrix), "sites ×", ncol(comm_matrix), "species\n")

5.2 Understanding Diversity: A Tale of Two Communities

Why Do We Need Multiple Diversity Metrics?

Consider two beetle communities, each with 100 individuals and 10 species:

Community A (Uneven):

Species 1: 91 individuals  ████████████████████████████████████████
Species 2:  1 individual   █
Species 3:  1 individual   █
... (7 more species with 1 each)

Community B (Even):

Species 1: 10 individuals  ████
Species 2: 10 individuals  ████
Species 3: 10 individuals  ████
... (7 more species with 10 each)

Both have richness = 10, but they’re very different! - Community A is dominated by one species (low evenness) - Community B is balanced (high evenness)

Different diversity indices capture different aspects of this structure.

5.3 Alpha Diversity Metrics

# SPECIES RICHNESS (S)
# Simply: How many species are present?
richness <- specnumber(comm_matrix)

# SHANNON DIVERSITY (H')
# Combines richness AND evenness
# Higher = more diverse OR more even
shannon <- diversity(comm_matrix, index = "shannon")

# SIMPSON DIVERSITY (1-D)
# Probability that two random individuals are different species
# Higher = more diverse
simpson <- diversity(comm_matrix, index = "simpson")

# EVENNESS (Pielou's J)
# How evenly are individuals distributed among species?
# J = H' / ln(S)
# J = 1 means perfectly even; J → 0 means one species dominates
evenness <- shannon / log(richness)

# Compile results
alpha_diversity <- data.frame(
  site = rownames(comm_matrix),
  richness = richness,
  shannon = round(shannon, 3),
  simpson = round(simpson, 3),
  evenness = round(evenness, 3)
) %>%
  left_join(env_data %>% rownames_to_column("site"), by = "site")

print(alpha_diversity)

When to Use Each Index:

Index	Best For	Interpretation
Richness	Simple comparisons, presence/absence data	“How many species?”
Shannon	General diversity, most publications	“How uncertain is species identity?”
Simpson	When dominance matters	“What’s the chance two individuals differ?”
Evenness	Understanding community structure	“Is one species dominating?”

For most ecological studies, report at least richness AND Shannon.

5.4 Rarefaction: Correcting for Sampling Effort

The Sampling Problem:

You collected 500 beetles at Site A (found 25 species) and 100 beetles at Site B (found 15 species).

Is Site A really more diverse? Or did you just sample more?

Rarefaction asks: “If I had caught only 100 beetles at Site A, how many species would I expect?”

# Check sample sizes (total individuals per site)
sample_sizes <- rowSums(comm_matrix)
cat("Sample sizes per site:\n")
print(sample_sizes)
cat("\nRange:", min(sample_sizes), "-", max(sample_sizes), "individuals\n")

# If sample sizes differ greatly, rarefaction is important!
if (max(sample_sizes) / min(sample_sizes) > 2) {
  cat("⚠️ Sample sizes vary by more than 2x. Rarefaction recommended!\n")
}

# Rarefy to smallest sample size
min_n <- min(sample_sizes)
rarefied_richness <- rarefy(comm_matrix, sample = min_n)

# Compare raw vs rarefied
comparison <- data.frame(
  site = rownames(comm_matrix),
  raw_richness = richness,
  sample_size = sample_sizes,
  rarefied_richness = round(rarefied_richness, 1),
  difference = round(richness - rarefied_richness, 1)
)
print(comparison)

# Rarefaction curves
rarecurve(comm_matrix, 
          step = 5,
          col = rainbow(nrow(comm_matrix)),
          lwd = 2,
          xlab = "Number of Individuals Sampled",
          ylab = "Species Richness",
          main = "Rarefaction Curves: Are Sites Adequately Sampled?")
abline(v = min_n, lty = 2, col = "gray")
legend("bottomright", legend = rownames(comm_matrix),
       col = rainbow(nrow(comm_matrix)), lwd = 2, cex = 0.7)

Reading Rarefaction Curves:

Steep curve: Still discovering new species (undersampled)
Flattening curve: Approaching true richness (well-sampled)
Compare at same x-value: Fair comparison of richness

If curves are still steep at your sample sizes, you probably haven’t captured the full community!

5.5 Visualizing and Testing Diversity Differences

# Visualize diversity by habitat
ggplot(alpha_diversity, aes(x = habitat, y = shannon, fill = habitat)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.1, size = 3, alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    x = "Habitat",
    y = "Shannon Diversity (H')",
    title = "Does Beetle Diversity Differ Between Habitats?"
  ) +
  theme_bw() +
  theme(legend.position = "none")

Why Kruskal-Wallis Test?

We use Kruskal-Wallis instead of ANOVA because:

Small sample sizes per group (often n < 10)
Diversity data often non-normal
It’s a “non-parametric” test - fewer assumptions

Kruskal-Wallis tests whether the RANKS differ between groups.

# Statistical test: Do habitats differ in diversity?
kruskal_result <- kruskal.test(shannon ~ habitat, data = alpha_diversity)
print(kruskal_result)

# Interpretation
cat("\n=== INTERPRETATION ===\n")
if (kruskal_result$p.value < 0.05) {
  cat("✓ Significant difference in Shannon diversity between habitats (p =",
      round(kruskal_result$p.value, 4), ")\n")
  
  # Post-hoc pairwise tests
  cat("\nPairwise comparisons:\n")
  pairwise.wilcox.test(alpha_diversity$shannon, alpha_diversity$habitat,
                       p.adjust.method = "bonferroni")
} else {
  cat("✗ No significant difference detected (p =", 
      round(kruskal_result$p.value, 4), ")\n")
  cat("  This doesn't mean habitats are equal - we may lack statistical power.\n")
}

☕ BREAK: 10:00 - 10:15

6 Session 6: Ordination - Seeing Community Patterns

Time: 10:15 - 11:15 (60 minutes)

6.1 Why Ordination?

The Challenge of Multivariate Data:

Your community matrix has many species (columns). Each site exists in a high-dimensional “species space.”

Ordination reduces these many dimensions to 2-3 that we can visualize, while preserving the important patterns.

Key insight: Sites that are close together in an ordination plot have similar species compositions. Sites far apart are different.

6.2 The Foundation: Distance Matrices

# First, we calculate how DIFFERENT each pair of sites is
# Bray-Curtis dissimilarity is the standard for abundance data

dist_bray <- vegdist(comm_matrix, method = "bray")

# View as matrix (first few sites)
cat("Bray-Curtis Dissimilarity Matrix:\n")
print(round(as.matrix(dist_bray)[1:4, 1:4], 3))

Why Bray-Curtis Dissimilarity?

Dissimilarity	Formula Idea	Best For
Bray-Curtis	Compares shared vs total abundance	Abundance data (most common)
Jaccard	Compares shared vs total species	Presence/absence data
Euclidean	Straight-line distance	Continuous environmental data

Bray-Curtis is preferred because:

Range: 0 (identical) to 1 (completely different)
Two sites with no species in common = maximum dissimilarity
Joint absences don’t make sites similar (unlike Euclidean)

6.3 NMDS vs PCoA: Which to Choose?

╔═══════════════════════════════════════════════════════════════════════════════╗
║                    ORDINATION METHOD DECISION TREE                             ║
╚═══════════════════════════════════════════════════════════════════════════════╝

                    What type of data do you have?
                                    │
              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
              ▼                     ▼                     ▼
        ┌──────────┐         ┌──────────┐         ┌──────────┐
        │ Abundance│         │ Presence/│         │ Environ- │
        │ data     │         │ Absence  │         │ mental   │
        └────┬─────┘         └────┬─────┘         └────┬─────┘
             │                    │                    │
             ▼                    ▼                    ▼
        Bray-Curtis          Jaccard            Consider PCA
        distance             distance           (different method)
             │                    │
             └──────────┬─────────┘
                        │
                        ▼
        What matters most for your question?
                        │
           ┌────────────┴────────────┐
           │                         │
           ▼                         ▼
    ┌──────────────┐         ┌──────────────┐
    │ Preserving   │         │ Knowing %    │
    │ RELATIVE     │         │ VARIANCE     │
    │ positions    │         │ explained    │
    │ (which sites │         │ by each axis │
    │ are similar) │         │              │
    └──────┬───────┘         └──────┬───────┘
           │                        │
           ▼                        ▼
      ┌────────┐              ┌────────┐
      │  NMDS  │              │  PCoA  │
      └────────┘              └────────┘
           │                        │
           ▼                        ▼
    "The ecology                "Microbiome
     standard"                   standard"
    Most published               Reports %
    ecology studies              variance

Quick Decision Guide:

Feature	NMDS	PCoA
What it preserves	Rank order of distances	Actual distances
Axes meaning	Arbitrary (no % variance)	% variance explained
Reproducibility	May vary (use set.seed!)	Always the same
Best for	Community ecology	When % variance matters
Stress/Quality	Report stress value	Report eigenvalues

For most ecological studies: Use NMDS. It’s the community standard and handles ecological data well.

Use PCoA when: You need to report % variance explained, or for microbiome data or when your data mostly consist of singletons or 0.

6.4 NMDS: Non-metric Multidimensional Scaling

# Run NMDS
set.seed(123)  # For reproducibility - NMDS uses random starts!

nmds_result <- metaMDS(
  comm_matrix,
  distance = "bray",     # Bray-Curtis dissimilarity
  k = 2,                 # 2 dimensions for visualization
  trymax = 100           # Try 100 random starting positions
)

# Check the stress value
cat("\n=== NMDS RESULTS ===\n")
cat("Stress value:", round(nmds_result$stress, 4), "\n")

Interpreting NMDS Stress:

Stress measures how well the 2D plot represents the actual multi-dimensional distances.

Stress	Quality	Interpretation
< 0.05	Excellent	Almost perfect representation
0.05 - 0.10	Good	Safe to interpret
0.10 - 0.20	Acceptable	Interpret with some caution
0.20 - 0.30	Poor	Distances may be misleading
> 0.30	Very poor	Do not interpret!

Always report stress in publications!

# Interpret stress value
if (nmds_result$stress < 0.10) {
  cat("✓ Good fit! The 2D plot accurately represents community distances.\n")
} else if (nmds_result$stress < 0.20) {
  cat("⚠ Acceptable fit. Interpret patterns with some caution.\n")
} else {
  cat("✗ Poor fit. Consider using 3 dimensions (k=3) or different approach.\n")
}

# Stressplot shows how well ordination distances match original distances
stressplot(nmds_result, main = "Shepard Diagram: How Well Does NMDS Fit?")

6.5 Creating NMDS Plots

# Extract site scores
nmds_scores <- as.data.frame(scores(nmds_result, display = "sites"))
nmds_scores$site <- rownames(nmds_scores)

# Add environmental data
nmds_scores <- nmds_scores %>%
  left_join(env_data %>% rownames_to_column("site"), by = "site")

# Create plot
p_nmds <- ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, 
                                   color = habitat, shape = landscape)) +
  geom_point(size = 4, alpha = 0.8) +
  stat_ellipse(aes(group = habitat), linetype = 2, level = 0.95) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "NMDS: Beetle Community Composition",
    subtitle = paste("Stress =", round(nmds_result$stress, 3)),
    color = "Habitat",
    shape = "Landscape"
  ) +
  theme_bw() +
  coord_equal()  # IMPORTANT: Equal scaling on both axes!

print(p_nmds)

Reading NMDS Plots:

Close points = Similar beetle communities
Distant points = Different beetle communities
Cluster by color = Habitats have distinct communities
Overlapping ellipses = Communities are similar between habitats
Separated ellipses = Communities differ between habitats

6.6 PCoA: Principal Coordinates Analysis

# Run PCoA
pcoa_result <- pcoa(dist_bray)

# Check variance explained
eigenvalues <- pcoa_result$values$Eigenvalues
var_explained <- eigenvalues / sum(abs(eigenvalues)) * 100

cat("\n=== PCoA VARIANCE EXPLAINED ===\n")
cat("Axis 1:", round(var_explained[1], 1), "%\n")
cat("Axis 2:", round(var_explained[2], 1), "%\n")
cat("Axes 1+2:", round(sum(var_explained[1:2]), 1), "%\n")

# Extract scores and create plot
pcoa_scores <- as.data.frame(pcoa_result$vectors[, 1:2])
colnames(pcoa_scores) <- c("PCoA1", "PCoA2")
pcoa_scores$site <- rownames(pcoa_scores)
pcoa_scores <- pcoa_scores %>%
  left_join(env_data %>% rownames_to_column("site"), by = "site")

p_pcoa <- ggplot(pcoa_scores, aes(x = PCoA1, y = PCoA2, color = habitat)) +
  geom_point(size = 4, alpha = 0.8) +
  stat_ellipse(linetype = 2) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "PCoA: Beetle Community Composition",
    x = paste0("PCoA1 (", round(var_explained[1], 1), "%)"),
    y = paste0("PCoA2 (", round(var_explained[2], 1), "%)"),
    color = "Habitat"
  ) +
  theme_bw() +
  coord_equal()

print(p_pcoa)

🍽️ LUNCH: 11:15 - 12:15

7 Session 7: Statistical Testing

Time: 12:15 - 13:30 (75 minutes)

7.1 The Statistical Testing Workflow

╔═══════════════════════════════════════════════════════════════════════════════╗
║               COMMUNITY ANALYSIS WORKFLOW                                      ║
╚═══════════════════════════════════════════════════════════════════════════════╝

        ┌──────────────────────────────────────────────────────────┐
        │  STEP 1: VISUALIZE (NMDS/PCoA)                          │
        │  "Do the groups LOOK different?"                        │
        │  If no visible pattern, statistical test will likely    │
        │  be non-significant too.                                │
        └────────────────────────────┬─────────────────────────────┘
                                     │
                                     ▼
        ┌──────────────────────────────────────────────────────────┐
        │  STEP 2: TEST SIGNIFICANCE (PERMANOVA)                  │
        │  "Is the difference STATISTICALLY SIGNIFICANT?"         │
        │  p < 0.05 means unlikely to be due to chance.           │
        └────────────────────────────┬─────────────────────────────┘
                                     │
                                     ▼
        ┌──────────────────────────────────────────────────────────┐
        │  STEP 3: CHECK EFFECT SIZE (R²)                         │
        │  "HOW MUCH of the variation does habitat explain?"      │
        │  A significant p-value with tiny R² = trivial effect.   │
        └────────────────────────────┬─────────────────────────────┘
                                     │
                                     ▼
        ┌──────────────────────────────────────────────────────────┐
        │  STEP 4: CHECK ASSUMPTIONS (betadisper)                 │
        │  "Is the test valid?"                                   │
        │  PERMANOVA assumes groups have similar spread.          │
        └────────────────────────────┬─────────────────────────────┘
                                     │
                                     ▼
        ┌──────────────────────────────────────────────────────────┐
        │  STEP 5: IDENTIFY KEY SPECIES (SIMPER)                  │
        │  "WHICH species drive the differences?"                 │
        │  Connects statistics back to biology.                   │
        └──────────────────────────────────────────────────────────┘

7.2 PERMANOVA: Testing for Community Differences

Why PERMANOVA Instead of ANOVA?

ANOVA tests one variable at a time (e.g., “Does richness differ?”)
PERMANOVA tests the entire community simultaneously (e.g., “Does species composition differ?”)

PERMANOVA = Permutational Multivariate Analysis of Variance

It works by:

Calculating distances between all sites (Bray-Curtis)
Testing whether within-group distances are smaller than between-group distances
Using permutations (shuffling labels) to assess significance

# PERMANOVA: Do beetle communities differ between habitats?
permanova_result <- adonis2(
  comm_matrix ~ habitat * landscape,  # Test habitat, landscape, and interaction
  data = env_data,
  method = "bray",                    # Bray-Curtis dissimilarity
  permutations = 999                  # 999 permutations for p-value
)

print(permanova_result)

7.2.1 Interpreting PERMANOVA Output

cat("\n=== PERMANOVA INTERPRETATION ===\n\n")

# Extract key values
r2_habitat <- permanova_result$R2[1]
f_habitat <- permanova_result$F[1]
p_habitat <- permanova_result$`Pr(>F)`[1]

cat("HABITAT EFFECT:\n")
cat("  F-value:", round(f_habitat, 2), "\n")
cat("  R² =", round(r2_habitat, 3), 
    "(", round(r2_habitat * 100, 1), "% of variation explained)\n")
cat("  p-value:", p_habitat, "\n\n")

# Significance interpretation
if (p_habitat < 0.001) {
  cat("  ★ Highly significant (p < 0.001)\n")
} else if (p_habitat < 0.01) {
  cat("  ★ Very significant (p < 0.01)\n")
} else if (p_habitat < 0.05) {
  cat("  ★ Significant (p < 0.05)\n")
} else {
  cat("  ✗ Not significant (p ≥ 0.05)\n")
}

# Effect size interpretation
cat("\nEFFECT SIZE INTERPRETATION:\n")
if (r2_habitat < 0.05) {
  cat("  Effect is TINY - statistically significant but ecologically trivial\n")
} else if (r2_habitat < 0.10) {
  cat("  Effect is SMALL - habitat explains a minor portion\n")
} else if (r2_habitat < 0.25) {
  cat("  Effect is MEDIUM - habitat is an important factor\n")
} else {
  cat("  Effect is LARGE - habitat is a major driver of community structure\n")
}

Effect Size (R²) Guide:

R²	Interpretation	Example
< 0.05	Tiny	Habitat explains less than 5% of variation
0.05-0.10	Small	Habitat is one of many factors
0.10-0.25	Medium	Habitat is an important driver
> 0.25	Large	Habitat is a dominant factor

Always report both p-value AND R²! A significant p-value with tiny R² means the effect is real but not very important.

7.3 Checking Assumptions: Dispersion

Why Check Dispersion?

PERMANOVA assumes that groups have similar within-group variation (dispersion).

If one habitat has highly variable communities and another is very consistent, PERMANOVA might detect this as a “difference” even if the average compositions are identical!

betadisper tests this assumption.

# Test for homogeneity of dispersion
dispersion <- betadisper(dist_bray, env_data$habitat)
dispersion_test <- permutest(dispersion, permutations = 999)

cat("\n=== DISPERSION TEST ===\n")
print(dispersion_test)

# Interpretation
if (dispersion_test$tab$`Pr(>F)`[1] > 0.05) {
  cat("\n✓ Dispersions are homogeneous (p > 0.05).\n")
  cat("  PERMANOVA assumption is met - results are reliable.\n")
} else {
  cat("\n⚠ Dispersions are UNEQUAL (p < 0.05)!\n")
  cat("  PERMANOVA results may be influenced by dispersion differences.\n")
  cat("  Interpret with caution.\n")
}

# Visualize dispersion
boxplot(dispersion, main = "Distance to Group Centroid by Habitat",
        xlab = "Habitat", ylab = "Distance to centroid")

7.4 SIMPER: Which Species Matter?

SIMPER = Similarity Percentages

PERMANOVA tells us communities differ, but not WHY.

SIMPER identifies which species contribute most to the differences.

This connects statistics back to biology!

# SIMPER analysis
simper_result <- simper(comm_matrix, env_data$habitat, permutations = 999)

cat("\n=== SIMPER RESULTS ===\n")
cat("Top species contributing to habitat differences:\n\n")
summary(simper_result)

Reading SIMPER Output:

Column	Meaning
average	Average contribution to dissimilarity
sd	Standard deviation of contribution
ratio	average/sd - consistency of contribution
cumsum	Cumulative sum (% explained so far)

Key indicators:

High average + low sd = Consistent differentiator
ratio > 1 = Reliable indicator species
Look at species contributing to ~70% cumsum

☕ BREAK: 13:30 - 13:45

8 Session 8: Interpretation & Synthesis

Time: 13:45 - 14:45 (60 minutes)

8.1 The Interpretation Framework

For Every Statistical Result, Ask:

WHAT did we find? (The statistical result)
HOW BIG is the effect? (Effect size, not just p-value)
WHY might this happen? (Ecological mechanisms)
SO WHAT? (Conservation/management implications)
WHAT NEXT? (Limitations, new questions)

8.2 Example Interpretation

Example Finding: Beetle communities differ significantly between habitats (PERMANOVA: R² = 0.32, p = 0.001, stress = 0.12)

1. WHAT did we find?

Beetle species composition differs between forest, grassland, and agricultural habitats. About 32% of the total variation in beetle communities is explained by habitat type alone.

2. HOW BIG is the effect?

This is a large effect (R² > 0.25). Habitat is a major driver of beetle community structure. The NMDS stress of 0.12 indicates the pattern is reliably represented in 2D.

3. WHY might this happen? (Ecological mechanisms)

Habitat filtering: Forest provides leaf litter, shade, stable moisture - conditions that favor certain beetle groups (many Carabidae, Staphylinidae)
Resource availability: Different prey, detritus, and microhabitats in each habitat
Microclimate: Forests buffer temperature extremes; grasslands experience wider fluctuations
Disturbance regime: Agriculture has regular disturbance (plowing, pesticides) that selects for mobile generalist species
Historical factors: Forest beetles may have limited dispersal ability, excluding them from agricultural matrices

4. SO WHAT? (Implications)

Conservation: Forest fragments in agricultural landscapes may be critical refugia for specialized beetle species
Pest management: Agricultural communities may lack natural enemies present in forests
Restoration: Simply recreating habitat structure may not restore beetle communities if source populations are absent
Monitoring: Beetles could serve as indicators of habitat quality

5. WHAT NEXT? (Limitations and new questions)

Limitations: - Single sampling season (seasonal turnover not captured) - Pitfall traps favor ground-active species (canopy beetles missed) - Morphospecies identification may underestimate true diversity - No data on habitat quality within types

New questions: - Which specific environmental variables (temperature, leaf litter depth, soil moisture) drive beetle distributions? - Does forest fragment size affect beetle community composition? - How quickly do beetle communities recover after habitat restoration? - Are there “spillover” effects at habitat edges?

8.3 Creating Summary Results

# Create results summary table
results_summary <- data.frame(
  Analysis = c(
    "Alpha Diversity (Shannon)",
    "NMDS Ordination",
    "PERMANOVA (Habitat)",
    "PERMANOVA (Landscape)", 
    "Dispersion Test"
  ),
  Statistic = c(
    paste("Kruskal χ² =", round(kruskal_result$statistic, 2)),
    paste("Stress =", round(nmds_result$stress, 3)),
    paste("F =", round(permanova_result$F[1], 2)),
    paste("F =", round(permanova_result$F[2], 2)),
    paste("F =", round(dispersion_test$tab$F[1], 2))
  ),
  Effect_Size = c(
    "See boxplot",
    "Visual separation",
    paste("R² =", round(permanova_result$R2[1], 3)),
    paste("R² =", round(permanova_result$R2[2], 3)),
    "NA"
  ),
  P_value = c(
    round(kruskal_result$p.value, 4),
    "NA",
    permanova_result$`Pr(>F)`[1],
    permanova_result$`Pr(>F)`[2],
    round(dispersion_test$tab$`Pr(>F)`[1], 4)
  ),
  Interpretation = c(
    ifelse(kruskal_result$p.value < 0.05, "Significant", "Not significant"),
    ifelse(nmds_result$stress < 0.2, "Good fit", "Poor fit"),
    ifelse(permanova_result$`Pr(>F)`[1] < 0.05, "Communities differ", "No difference"),
    ifelse(permanova_result$`Pr(>F)`[2] < 0.05, "Effect of landscape", "No effect"),
    ifelse(dispersion_test$tab$`Pr(>F)`[1] > 0.05, "Assumption met", "Assumption violated")
  )
)

cat("\n╔═══════════════════════════════════════════════════════════════╗\n")
cat("║                    RESULTS SUMMARY TABLE                       ║\n")
cat("╚═══════════════════════════════════════════════════════════════╝\n\n")
print(results_summary, row.names = FALSE)

8.4 Final Publication Figure

# Create comprehensive figure
p_final <- ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, 
                                    color = habitat, shape = landscape)) +
  geom_point(size = 5, alpha = 0.8) +
  stat_ellipse(aes(group = habitat), linetype = 2, linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  scale_shape_manual(values = c(16, 17)) +
  labs(
    title = "Beetle Community Composition Across Habitats",
    subtitle = sprintf("NMDS stress = %.3f | PERMANOVA: R² = %.2f, p = %.3f",
                       nmds_result$stress,
                       permanova_result$R2[1],
                       permanova_result$`Pr(>F)`[1]),
    color = "Habitat",
    shape = "Landscape"
  ) +
  theme_bw(base_size = 14) +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(size = 10)
  ) +
  coord_equal()

print(p_final)

# Save
ggsave("figures/nmds_final.png", p_final, width = 10, height = 8, dpi = 300)

8.5 Complete Analysis Template

#============================================================================
# COMPLETE ANALYSIS SCRIPT TEMPLATE
# Copy and modify for your own analyses
#============================================================================

# 1. SETUP ----
rm(list = ls())
library(tidyverse)
library(vegan)
library(ape)
set.seed(42)

# 2. IMPORT DATA ----
raw_data <- read.csv("data/raw/sample_insect_data.csv")

# 3. EXPLORE AND CHOOSE FOCAL TAXA ----
# [Run decision framework from Session 4]

# 4. PREPARE DATA ----
comm_matrix <- raw_data %>%
  filter(order == "Coleoptera") %>%  # Your chosen focal group
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(names_from = morphospecies, values_from = abundance, values_fill = 0) %>%
  column_to_rownames("site")

env_data <- raw_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%
  column_to_rownames("site")
env_data <- env_data[rownames(comm_matrix), ]

# 5. ALPHA DIVERSITY ----
alpha_div <- data.frame(
  site = rownames(comm_matrix),
  richness = specnumber(comm_matrix),
  shannon = diversity(comm_matrix, "shannon")
) %>%
  left_join(env_data %>% rownames_to_column("site"), by = "site")

kruskal.test(shannon ~ habitat, data = alpha_div)

# 6. BETA DIVERSITY ----
dist_bray <- vegdist(comm_matrix, method = "bray")
nmds <- metaMDS(comm_matrix, distance = "bray", k = 2, trymax = 100)
cat("NMDS Stress:", nmds$stress, "\n")

# 7. STATISTICAL TESTS ----
permanova <- adonis2(comm_matrix ~ habitat * landscape, 
                     data = env_data, method = "bray", permutations = 999)
print(permanova)

betadisper(dist_bray, env_data$habitat) %>% permutest()

simper(comm_matrix, env_data$habitat)

# 8. SAVE OUTPUTS ----
write.csv(alpha_div, "output/alpha_diversity.csv", row.names = FALSE)
ggsave("figures/nmds_plot.png", width = 8, height = 6, dpi = 300)

Wrap-up & Next Steps

Time: 14:45 - 15:00 (15 minutes)

What You Learned

Day 1:

✅ R basics and data import
✅ Data wrangling with tidyverse
✅ Choosing focal taxa - the decision framework
✅ Creating community matrices

Day 2:

✅ Diversity metrics and their ecological meaning
✅ Rarefaction for sampling effort
✅ NMDS vs PCoA - when to use which
✅ PERMANOVA - testing community differences
✅ SIMPER - identifying key species
✅ Ecological interpretation of results

Resources for Continued Learning

Books:

Numerical Ecology with R (Borcard, Gillet & Legendre) - The bible for this topic
R for Data Science (Wickham & Grolemund) - Free online: r4ds.had.co.nz

Online:

GUSTA ME multivariate guide: https://sites.google.com/site/mb3gustame/
vegan package tutorial: https://cran.r-project.org/web/packages/vegan/vignettes/

Packages to explore:

BiodiversityR - More biodiversity analysis tools
indicspecies - Indicator species analysis
iNEXT - Diversity extrapolation

Quick Reference Card

╔══════════════════════════════════════════════════════════════════════════════╗
║                    R FOR INSECT ECOLOGY - QUICK REFERENCE                    ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║ DATA WRANGLING (dplyr)                                                       ║
║ ─────────────────────────────────────────────────────────────────────────    ║
║ filter(data, condition)          Keep rows matching condition                ║
║ select(data, columns)            Keep selected columns                       ║
║ mutate(data, new_col = ...)      Create new columns                          ║
║ group_by() %>% summarise()       Aggregate by groups                         ║
║ pivot_wider() / pivot_longer()   Reshape data                                ║
║                                                                              ║
║ DIVERSITY METRICS (vegan)                                                    ║
║ ─────────────────────────────────────────────────────────────────────────    ║
║ specnumber(x)                    Species richness                            ║
║ diversity(x, "shannon")          Shannon H'                                  ║
║ diversity(x, "simpson")          Simpson 1-D                                 ║
║ rarefy(x, sample = n)            Rarefied richness                           ║
║                                                                              ║
║ ORDINATION                                                                   ║
║ ─────────────────────────────────────────────────────────────────────────    ║
║ vegdist(x, "bray")               Bray-Curtis dissimilarity                   ║
║ metaMDS(x, k = 2)                NMDS (report stress!)                       ║
║ pcoa(dist)                       PCoA (report % variance)                    ║
║                                                                              ║
║ STATISTICAL TESTS                                                            ║
║ ─────────────────────────────────────────────────────────────────────────    ║
║ adonis2(matrix ~ factor)         PERMANOVA (report R² AND p!)                ║
║ betadisper(dist, group)          Check dispersion assumption                 ║
║ simper(matrix, group)            Identify key species                        ║
║                                                                              ║
║ INTERPRETATION GUIDE                                                         ║
║ ─────────────────────────────────────────────────────────────────────────    ║
║ NMDS Stress:  < 0.10 good | 0.10-0.20 okay | > 0.20 poor                     ║
║ PERMANOVA R²: < 0.05 tiny | 0.05-0.10 small | 0.10-0.25 med | > 0.25 large   ║
║                                                                              ║
║ DECISION CHECKLIST FOR FOCAL TAXA                                            ║
║ ─────────────────────────────────────────────────────────────────────────    ║
║ □ Total abundance ≥ 50?                                                      ║
║ □ Number of species ≥ 5?                                                     ║
║ □ Present in ≥ 50% of sites?                                                 ║
║ □ Shows habitat variation (CV > 30%)?                                        ║
║ □ Ecologically relevant for pitfall sampling?                                ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

Happy coding and happy bug hunting! 🐛

Workshop: Introduction to R Statistics for Insect Ecology

2-Day Intensive for Entomology Students: From Raw Data to Ecological Insights

Workshop Manual

11-12 February 2026