Workshop Overview

Who Is This Workshop For?

This workshop is designed for entomology students who are new to R and may need a refresher on ecological concepts related to biodiversity data analysis. No prior R experience is required!

By the end of this workshop, you will be able to:

  • Import and clean insect survey data
  • Decide which insect groups have β€œinteresting stories” worth analyzing
  • Calculate and interpret diversity metrics
  • Visualize community patterns
  • Test whether habitats differ in their insect communities
  • Interpret results in an ecological context

The Ecological Question Behind Everything

When we set pitfall traps across different habitats, we’re fundamentally asking:

β€œDo different environments support different insect communities, and why?”

This question connects to foundational ecological theories:

Key Ecological Concepts:

  • Niche Theory: Each species has specific environmental requirements (temperature, moisture, food). Species occur where conditions match their niche.

  • Habitat Filtering: The environment β€œfilters” which species can survive. Forest conditions filter out sun-loving species; agricultural disturbance filters out sensitive specialists.

  • Species-Area Relationship: Larger, more complex habitats typically support more species.

  • Indicator Species: Some species are sensitive to environmental conditions and can indicate habitat quality or disturbance level.

Our statistical analyses are tools to detect these ecological patterns and test whether the differences we observe are real or just due to chance.

The Complete Analysis Roadmap

Before writing any code, understand the journey from raw data to ecological insights:

╔═══════════════════════════════════════════════════════════════════════════════╗
β•‘                    COMPLETE ANALYSIS WORKFLOW                                  β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚   RAW DATA      β”‚
                              β”‚  (Your pitfall  β”‚
                              β”‚   trap counts)  β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                              β–Ό                              β”‚
        β”‚  ╔═══════════════════════════════════════════════════════╗  β”‚
        β”‚  β•‘  PHASE 1: DATA PREPARATION (Day 1, Sessions 1-3)      β•‘  β”‚
        β”‚  ╠═══════════════════════════════════════════════════════╣  β”‚
        β”‚  β•‘  β€’ Import data into R                                 β•‘  β”‚
        β”‚  β•‘  β€’ Check for errors (typos, impossible values)        β•‘  β”‚
        β”‚  β•‘  β€’ Standardize names (taxonomy is messy!)             β•‘  β”‚
        β”‚  β•‘  β€’ Create community matrix (sites Γ— species)          β•‘  β”‚
        β”‚  β•‘                                                       β•‘  β”‚
        β”‚  β•‘  KEY QUESTION: "Is my data clean and trustworthy?"    β•‘  β”‚
        β”‚  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
        β”‚                              β”‚                              β”‚
        β”‚                              β–Ό                              β”‚
        β”‚  ╔═══════════════════════════════════════════════════════╗  β”‚
        β”‚  β•‘  PHASE 2: DATA EXPLORATION (Day 1, Session 4)         β•‘  β”‚
        β”‚  ╠═══════════════════════════════════════════════════════╣  β”‚
        β”‚  β•‘  β€’ Summarize: How many species? How many individuals? β•‘  β”‚
        β”‚  β•‘  β€’ Explore: Which insect ORDERS are abundant?         β•‘  β”‚
        β”‚  β•‘  β€’ Decide: Which groups have "interesting stories"?   β•‘  β”‚
        β”‚  β•‘  β€’ Visualize: What patterns can we SEE?               β•‘  β”‚
        β”‚  β•‘                                                       β•‘  β”‚
        β”‚  β•‘  KEY QUESTION: "Which insect groups should I focus    β•‘  β”‚
        β”‚  β•‘                 my analysis on?"                      β•‘  β”‚
        β”‚  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
        β”‚                              β”‚                              β”‚
   D    β”‚                              β–Ό                              β”‚
   A    β”‚  ╔═══════════════════════════════════════════════════════╗  β”‚
   Y    β”‚  β•‘  PHASE 3: DIVERSITY ANALYSIS (Day 2, Session 5)       β•‘  β”‚
        β”‚  ╠═══════════════════════════════════════════════════════╣  β”‚
   1    β”‚  β•‘  ALPHA DIVERSITY (within-site)                        β•‘  β”‚
        β”‚  β•‘  β€’ How many species at each site? (Richness)          β•‘  β”‚
   |    β”‚  β•‘  β€’ How diverse is each site? (Shannon, Simpson)       β•‘  β”‚
        β”‚  β•‘  β€’ Are differences due to sampling effort?            β•‘  β”‚
   D    β”‚  β•‘    (Rarefaction)                                      β•‘  β”‚
   A    β”‚  β•‘                                                       β•‘  β”‚
   Y    β”‚  β•‘  KEY QUESTION: "Which habitats are most diverse?"     β•‘  β”‚
        β”‚  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
   2    β”‚                              β”‚                              β”‚
        β”‚                              β–Ό                              β”‚
        β”‚  ╔═══════════════════════════════════════════════════════╗  β”‚
        β”‚  β•‘  PHASE 4: COMMUNITY ANALYSIS (Day 2, Sessions 6-7)    β•‘  β”‚
        β”‚  ╠═══════════════════════════════════════════════════════╣  β”‚
        β”‚  β•‘  BETA DIVERSITY (between-sites)                       β•‘  β”‚
        β”‚  β•‘  β€’ How different are communities? (Distance matrix)   β•‘  β”‚
        β”‚  β•‘  β€’ Can we VISUALIZE the pattern? (NMDS/PCoA)          β•‘  β”‚
        β”‚  β•‘  β€’ Is the pattern SIGNIFICANT? (PERMANOVA)            β•‘  β”‚
        β”‚  β•‘  β€’ WHICH species drive differences? (SIMPER)          β•‘  β”‚
        β”‚  β•‘                                                       β•‘  β”‚
        β”‚  β•‘  KEY QUESTION: "Do habitats have different            β•‘  β”‚
        β”‚  β•‘                 communities? Which species matter?"   β•‘  β”‚
        β”‚  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
        β”‚                              β”‚                              β”‚
        β”‚                              β–Ό                              β”‚
        β”‚  ╔═══════════════════════════════════════════════════════╗  β”‚
        β”‚  β•‘  PHASE 5: INTERPRETATION (Day 2, Session 8)           β•‘  β”‚
        β”‚  ╠═══════════════════════════════════════════════════════╣  β”‚
        β”‚  β•‘  β€’ What do the numbers MEAN ecologically?             β•‘  β”‚
        β”‚  β•‘  β€’ What MECHANISMS might explain the patterns?        β•‘  β”‚
        β”‚  β•‘  β€’ What are the LIMITATIONS?                          β•‘  β”‚
        β”‚  β•‘  β€’ What NEW QUESTIONS arise?                          β•‘  β”‚
        β”‚  β•‘                                                       β•‘  β”‚
        β”‚  β•‘  KEY QUESTION: "So what? Why does this matter?"       β•‘  β”‚
        β”‚  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•  β”‚
        β”‚                              β”‚                              β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β–Ό
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚  PUBLICATION /  β”‚
                              β”‚  THESIS CHAPTER β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Workshop Schedule

╔══════════════════════════════════════════════════════════════════════════════╗
β•‘  DAY 1: DATA FOUNDATIONS & EXPLORATION           9:00 - 15:00                β•‘
╠══════════════════════════════════════════════════════════════════════════════╣
β•‘  09:00 - 10:15  Session 1: R Basics & Importing Data               75 min    β•‘
β•‘  10:15 - 10:30  β˜• Break                                            15 min    β•‘
β•‘  10:30 - 11:30  Session 2: Data Wrangling with tidyverse           60 min    β•‘
β•‘  11:30 - 12:30  🍽️ Lunch                                            60 min   β•‘
β•‘  12:30 - 13:45  Session 3: Data Cleaning & Community Matrices      75 min    β•‘
β•‘  13:45 - 14:00  β˜• Break                                            15 min    β•‘
β•‘  14:00 - 15:00  Session 4: Choosing Your Focal Taxa                60 min    β•‘
╠══════════════════════════════════════════════════════════════════════════════╣
β•‘  DAY 2: ANALYSIS & INTERPRETATION                9:00 - 15:00                β•‘
╠══════════════════════════════════════════════════════════════════════════════╣
β•‘  09:00 - 10:00  Session 5: Diversity Metrics & Visualization       60 min    β•‘
β•‘  10:00 - 10:15  β˜• Break                                            15 min    β•‘
β•‘  10:15 - 11:15  Session 6: Ordination - Seeing Community Patterns  60 min    β•‘
β•‘  11:15 - 12:15  🍽️ Lunch                                            60 min   β•‘
β•‘  12:15 - 13:30  Session 7: Statistical Testing                     75 min    β•‘
β•‘  13:30 - 13:45  β˜• Break                                            15 min    β•‘
β•‘  13:45 - 14:45  Session 8: Interpretation & Synthesis              60 min    β•‘
β•‘  14:45 - 15:00  Wrap-up & Next Steps                               15 min    β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Our Dataset

We’ll use sample_insect_data.csv - a pitfall trap survey of ground-dwelling arthropods.

About Pitfall Traps:

Pitfall traps are cups buried flush with the ground surface. Ground-active arthropods fall in and are collected. This method is:

  • Good for: Ground beetles (Carabidae), ants (Formicidae), spiders (Araneae), rove beetles (Staphylinidae)
  • Poor for: Flying insects, canopy-dwellers, sedentary species
  • Biased by: Trap activity (more active = more caught), body size, weather

Always consider sampling bias when interpreting results!

Column Description
site Site identifier (S01-S12)
habitat Habitat type (forest, grassland, agriculture)
landscape Landscape complexity (complex, simple)
order Insect order (Coleoptera, Hymenoptera, etc.)
family Insect family
morphospecies Species identifier
abundance Number of individuals

DAY 1: DATA FOUNDATIONS & EXPLORATION


1 Session 1: R Basics & Importing Data

Time: 09:00 - 10:15 (75 minutes)

1.1 Why R for Ecology?

Why do we use R instead of Excel?

  1. Reproducibility: Your analysis is saved as a script. Anyone (including future you) can re-run it exactly.

  2. Ecological packages: R has specialized packages like vegan designed specifically for community ecology.

  3. Handling complexity: Ecological data often has hundreds of species and complex designs. R handles this easily.

  4. Publication-quality figures: R creates beautiful, customizable plots ready for journals.

1.2 Loading Packages

# Packages are collections of functions that extend R's capabilities
# We load them at the start of EVERY session

library(tidyverse)    # Data wrangling and visualization
library(vegan)        # Community ecology analyses (diversity, ordination)
library(ape)          # Phylogenetic tools and PCoA ordination

# If you get "there is no package called 'x'", install it first:
# install.packages("x")

Why these packages?

  • tidyverse: Makes data manipulation intuitive with β€œverbs” like filter, select, mutate
  • vegan: The gold standard for community ecology - diversity indices, NMDS, PERMANOVA
  • ape: Provides PCoA ordination and phylogenetic analysis tools

1.3 Basic R Operations

# R as a calculator
2 + 3         # Addition
10 / 2        # Division
5^2           # Power (5 squared = 25)
sqrt(16)      # Square root

# Creating objects with <- (assignment operator)
species_count <- 42           # Store a number
site_name <- "Forest_A"       # Store text

# Vectors: collections of values (fundamental to R!)
abundance <- c(12, 45, 23, 8, 67, 34)
species <- c("Carabidae_sp1", "Carabidae_sp2", "Staphylinidae_sp1")

# Basic operations on vectors
length(abundance)        # How many elements? β†’ 6
sum(abundance)           # Total β†’ 189
mean(abundance)          # Average β†’ 31.5
max(abundance)           # Maximum β†’ 67
min(abundance)           # Minimum β†’ 8

# Extracting elements (indexing)
abundance[1]             # First element β†’ 12
abundance[c(1, 3, 5)]    # Elements 1, 3, and 5
abundance[abundance > 30] # Only values > 30 β†’ 45, 67, 34

1.4 Importing Data

# Import our main dataset
# Make sure the file is in your working directory or provide full path
insect_data <- read.csv("data/raw/sample_insect_data.csv")

# More robust import with options for common issues:
insect_data <- read.csv(
  "data/raw/sample_insect_data.csv",
  header = TRUE,                    # First row contains column names
  stringsAsFactors = FALSE,         # Keep text as text (not factors)
  na.strings = c("", "NA", "N/A", "-")  # Recognize these as missing data
)

1.5 Exploring Your Data

Why explore data immediately?

Before any analysis, you need to know:

  • Did the data import correctly?
  • Are there unexpected values or errors?
  • What is the structure of your data?

Never trust data blindly - always look at it first!

# First look at your data
head(insect_data)         # First 6 rows
tail(insect_data)         # Last 6 rows
View(insect_data)         # Spreadsheet view (capital V!)

# Structure and dimensions
str(insect_data)          # Column types and preview
dim(insect_data)          # Rows Γ— columns
nrow(insect_data)         # Number of rows
ncol(insect_data)         # Number of columns
names(insect_data)        # Column names

# Summary statistics
summary(insect_data)      # Statistical summary of each column

1.6 Checking for Problems

# Missing values (NA = "Not Available")
sum(is.na(insect_data))           # Total NAs in entire dataset
colSums(is.na(insect_data))       # NAs per column

# Check categorical columns for typos
unique(insect_data$habitat)       # What habitat values exist?
unique(insect_data$order)         # What insect orders are present?

# Frequency tables
table(insect_data$habitat)        # How many records per habitat?
table(insect_data$order)          # How many records per order?

# Check numeric ranges for impossible values
range(insect_data$abundance)      # Min and max abundance
summary(insect_data$abundance)    # Full summary

# Look for suspicious values
insect_data[insect_data$abundance < 0, ]  # Negative abundances?

β˜• BREAK: 10:15 - 10:30


2 Session 2: Data Wrangling with tidyverse

Time: 10:30 - 11:30 (60 minutes)

2.1 The Pipe Operator: Your Best Friend

Why use pipes (%>%)?

Pipes make code readable by turning nested operations into a step-by-step recipe. Read %>% as β€œTHEN”.

Instead of: β€œBake(Mix(Measure(flour, sugar), eggs))”

Write: β€œTake flour and sugar, THEN measure, THEN mix with eggs, THEN bake”

library(tidyverse)

# Without pipe - confusing nested functions:
round(mean(sqrt(c(1, 4, 9, 16))), 2)

# With pipe - clear step-by-step:
c(1, 4, 9, 16) %>%    # Start with these numbers
  sqrt() %>%           # THEN take square root
  mean() %>%           # THEN calculate mean
  round(2)             # THEN round to 2 decimals

# Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)

2.2 The Core dplyr Verbs

Think of these as actions you perform on your data:

Verb Action Example Use
filter() Keep rows that match a condition Keep only forest sites
select() Keep or remove columns Keep only site, species, abundance
mutate() Create new columns Calculate log-transformed abundance
arrange() Sort rows Sort by abundance (high to low)
summarise() Collapse to summary statistics Calculate mean abundance
group_by() Group data for operations Calculate mean BY habitat

2.2.1 filter(): Keep Rows Matching Conditions

# Keep only forest sites
insect_data %>%
  filter(habitat == "forest")

# Multiple conditions with AND (comma or &)
insect_data %>%
  filter(habitat == "forest", abundance > 10)

# Multiple options with OR (|) or %in%
insect_data %>%
  filter(habitat %in% c("forest", "grassland"))

# Exclude with !=
insect_data %>%
  filter(habitat != "agriculture")

2.2.2 select(): Choose Columns

# Keep specific columns
insect_data %>%
  select(site, habitat, morphospecies, abundance)

# Remove columns with minus sign
insect_data %>%
  select(-family)

# Select helpers for patterns
insect_data %>%
  select(starts_with("site"))     # Columns starting with "site"

insect_data %>%
  select(contains("species"))     # Columns containing "species"

2.2.3 mutate(): Create New Columns

insect_data %>%
  mutate(
    # Log-transform abundance (common in ecology due to right-skewed distributions)
    log_abundance = log(abundance + 1),  # +1 because log(0) is undefined
    
    # Create combined categories
    habitat_landscape = paste(habitat, landscape, sep = "_")
  )

2.2.4 group_by() + summarise(): The Power Combination

Why is this so powerful?

In ecology, we almost always want to compare GROUPS (habitats, treatments, sites). group_by() tells R to do calculations separately for each group.

# Calculate summary statistics BY habitat
insect_data %>%
  group_by(habitat) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    n_sites = n_distinct(site),
    mean_abundance = mean(abundance),
    .groups = "drop"  # Clean up grouping after summarise
  )

# Multiple grouping variables
insect_data %>%
  group_by(habitat, order) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    .groups = "drop"
  )

2.3 Reshaping Data: Wide vs Long

Ecological Data Formats:

  • Long format: One row per observation (site-species-abundance). Good for: data entry, plotting, dplyr operations.

  • Wide format (Community Matrix): Rows = sites, Columns = species. Required for: vegan package, ordination, diversity indices.

Most raw data is long; most analyses need wide. You’ll convert between them constantly!

# LONG TO WIDE: Create a community matrix
species_wide <- insect_data %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,    # Species become columns
    values_from = abundance,        # Cell values are abundances
    values_fill = 0                 # Missing = 0 (species absent)
  )

head(species_wide)

# WIDE TO LONG: For plotting
species_long <- species_wide %>%
  pivot_longer(
    cols = -site,             # All columns except site
    names_to = "species",     # Column names go to "species"
    values_to = "abundance"   # Values go to "abundance"
  )

🍽️ LUNCH: 11:30 - 12:30


3 Session 3: Data Cleaning & Community Matrices

Time: 12:30 - 13:45 (75 minutes)

3.1 Why Data Cleaning Matters

β€œGarbage In, Garbage Out”

Taxonomic data is notoriously messy. Common problems:

  • Inconsistent spelling: β€œCarabidae sp1” vs β€œCarabidae_sp1” vs β€œcarabidae sp. 1”
  • Case sensitivity: β€œForest” vs β€œforest” are treated as different!
  • Trailing spaces: β€œforest” (with space) β‰  β€œforest”
  • Impossible values: Negative abundances, abundances of 999999

Your cleaning decisions affect your conclusions. Document what you did!

# Check for problems first
unique(insect_data$habitat)  # Look for typos, case issues

# Common cleaning operations
clean_data <- insect_data %>%
  mutate(
    # Standardize text: lowercase and remove extra spaces
    habitat = tolower(trimws(habitat)),
    order = tolower(trimws(order)),
    
    # Fix specific typos (if any exist)
    habitat = case_when(
      habitat == "forrest" ~ "forest",    # Common typo
      habitat == "grasland" ~ "grassland",
      TRUE ~ habitat                       # Keep all others as-is
    )
  ) %>%
  # Remove impossible or missing values
  filter(
    abundance >= 0,        # No negative abundances
    !is.na(abundance),     # No missing abundances
    !is.na(morphospecies)  # No missing species IDs
  )

# Verify cleaning worked
unique(clean_data$habitat)

3.2 Creating the Community Matrix

What is a Community Matrix?

The community matrix is the fundamental data structure in community ecology:

              Species_A  Species_B  Species_C  Species_D
Site_1           12          5          0          8
Site_2            0         23         15          2
Site_3            7          0          0         45
  • Rows = Sampling units (sites, traps, plots)
  • Columns = Taxa (species, morphospecies)
  • Values = Abundance (or presence/absence: 0/1)

This matrix is what ordination and diversity analyses use!

# For this workshop, we'll focus on beetles (Coleoptera)
# We'll learn HOW to choose focal taxa in Session 4

comm_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%                    # Focus on beetles
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,
    values_from = abundance,
    values_fill = 0    # CRITICAL: Missing species = 0, not NA
  ) %>%
  column_to_rownames("site")  # Site names become row names

# Check the result
dim(comm_matrix)  # Sites Γ— Species
head(comm_matrix[, 1:5])  # First 5 species

Why values_fill = 0?

In ecology, if we didn’t record a species at a site, we assume it was absent (abundance = 0), not β€œunknown” (NA).

This is an important assumption! It assumes: - Equal sampling effort across sites - Species detection is reliable

If sampling effort varied, you should use rarefaction (Session 5).

3.3 Creating Environmental Data

# Environmental data: one row per site with habitat info
env_data <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%                           # One row per site
  column_to_rownames("site")               # Site names as row names

# CRITICAL: Ensure rows match between matrices!
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]
# rownames(comm_matrix): Extracts the unique IDs (like "S01", "S02") from your insect data.
# env_data[..., ]: Selects and reorders rows in the environmental table based on those IDs
# , (The blank space):  Tells R to keep all columns (e.g., habitat, elevation, distance).
# drop = FALSE --   A "defensive" command. It ensures that even if you only have one environmental variable, R keeps the data as a data frame instead of turning it into a simple list (vector)..


# Verify alignment (MUST be TRUE!)
all(rownames(comm_matrix) == rownames(env_data))

β˜• BREAK: 13:45 - 14:00


4 Session 4: Choosing Your Focal Taxa

Time: 14:00 - 15:00 (60 minutes)

The Central Question of This Session:

Your pitfall traps caught multiple insect orders (Coleoptera, Hymenoptera, Araneae, etc.). Which groups should you analyze in detail?

Not all groups are equal! Some have ecological stories to tell; others are just noise.

4.1 The Philosophy of Focal Taxa Selection

Why Not Analyze Everything?

  1. Statistical power: Groups with few individuals give unreliable results
  2. Ecological relevance: Not all groups respond to habitat differences
  3. Sampling bias: Pitfall traps don’t catch all groups equally
  4. Focused story: Papers are clearer when they focus on specific groups

Good focal taxa have:

  • Sufficient abundance (β‰₯50 individuals total)
  • Multiple species (β‰₯5 morphospecies)
  • Presence across most sites
  • Interesting variation between habitats
  • Ecological relevance to your question

4.2 Decision Framework for Choosing Focal Taxa

╔═══════════════════════════════════════════════════════════════════════════════╗
β•‘               DECISION FLOWCHART: CHOOSING FOCAL INSECT GROUPS                β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

For each insect ORDER in your data:
            β”‚
            β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  CRITERION 1: ABUNDANCE       β”‚
    β”‚  Total individuals β‰₯ 50?      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
         NO ◄───────┴───────► YES
          β”‚                    β”‚
          β–Ό                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EXCLUDE        β”‚   β”‚  CRITERION 2: DIVERSITY       β”‚
β”‚  Too few data   β”‚   β”‚  Number of morphospecies β‰₯ 5? β”‚
β”‚  for statistics β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
                           NO ◄───────┴───────► YES
                            β”‚                    β”‚
                            β–Ό                    β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚  POSSIBLE but   β”‚   β”‚  CRITERION 3: OCCURRENCE      β”‚
                β”‚  Limited        β”‚   β”‚  Present in β‰₯ 50% of sites?   β”‚
                β”‚  (can only      β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚  analyze abund- β”‚                   β”‚
                β”‚  ance, not      β”‚        NO ◄───────┴───────► YES
                β”‚  diversity)     β”‚         β”‚                    β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β–Ό                    β–Ό
                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                               β”‚  CAUTION        β”‚   β”‚  CRITERION 4: VARIATION       β”‚
                               β”‚  May be habitat β”‚   β”‚  Shows differences between    β”‚
                               β”‚  specialist or  β”‚   β”‚  habitats? (CV > 30%)         β”‚
                               β”‚  rare           β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
                                                          NO ◄───────┴───────► YES
                                                          β”‚                    β”‚
                                                          β–Ό                    β–Ό
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚  POSSIBLE       β”‚   β”‚  EXCELLENT      β”‚
                                              β”‚  May be         β”‚   β”‚  FOCAL TAXON!   β”‚
                                              β”‚  generalist     β”‚   β”‚                 β”‚
                                              β”‚  (still         β”‚   β”‚  High priority  β”‚
                                              β”‚  interesting!)  β”‚   β”‚  for analysis   β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.3 Implementing the Decision Framework

4.3.1 Step 1: Overall Summary

# What do we have?
cat("=== DATASET OVERVIEW ===\n")
cat("Total records:", nrow(insect_data), "\n")
cat("Total individuals:", sum(insect_data$abundance), "\n")
cat("Number of sites:", n_distinct(insect_data$site), "\n")
cat("Number of habitats:", n_distinct(insect_data$habitat), "\n")
cat("Number of orders:", n_distinct(insect_data$order), "\n")
cat("Number of morphospecies:", n_distinct(insect_data$morphospecies), "\n")

# How are individuals distributed across orders?
insect_data %>%
  group_by(order) %>%
  summarise(
    total_abundance = sum(abundance),
    percent_of_total = round(sum(abundance) / sum(insect_data$abundance) * 100, 1)
  ) %>%
  arrange(desc(total_abundance))

4.3.2 Step 2: Evaluate Each Order

# Calculate all criteria for each order
order_evaluation <- insect_data %>%
  group_by(order) %>%
  summarise(
    # Criterion 1: Abundance
    total_abundance = sum(abundance),
    
    # Criterion 2: Diversity
    n_morphospecies = n_distinct(morphospecies),
    
    # Criterion 3: Occurrence
    n_sites_present = n_distinct(site[abundance > 0]),
    pct_sites = round(n_distinct(site[abundance > 0]) / n_distinct(insect_data$site) * 100, 0),
    
    .groups = "drop"
  ) %>%
  arrange(desc(total_abundance))

print(order_evaluation)

# Visualize: Which orders are most abundant?
ggplot(order_evaluation, aes(x = reorder(order, total_abundance), 
                              y = total_abundance)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = paste0("n=", n_morphospecies, " spp")), 
            hjust = -0.1, size = 3) +
  coord_flip() +
  labs(
    x = "Order",
    y = "Total Abundance",
    title = "Which Insect Orders Dominate Your Samples?",
    subtitle = "Numbers show morphospecies count"
  ) +
  theme_bw()

4.3.3 Step 3: Calculate Habitat Variation

# For Criterion 4: Do abundances vary between habitats?
habitat_variation <- insect_data %>%
  group_by(order, habitat) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  group_by(order) %>%
  summarise(
    # Coefficient of Variation: higher = more variation between habitats
    cv_abundance = round(sd(abundance) / mean(abundance) * 100, 1),
    # Which habitat has most individuals?
    dominant_habitat = habitat[which.max(abundance)],
    # How much more in best vs worst habitat?
    max_min_ratio = round(max(abundance) / (min(abundance) + 1), 1),
    .groups = "drop"
  )

print(habitat_variation)

Interpreting Coefficient of Variation (CV):

  • CV < 30%: Low variation - this group is similar across habitats (generalist)
  • CV 30-50%: Moderate variation - some habitat preference
  • CV > 50%: High variation - strong habitat specialist

Both patterns are ecologically interesting! Generalists tell us habitats are functionally similar for that group; specialists show habitat filtering.

4.3.4 Step 4: Score and Rank Orders

# Combine criteria and score each order
focal_taxa_scores <- order_evaluation %>%
  left_join(habitat_variation, by = "order") %>%
  mutate(
    # Score each criterion (0-3 points each)
    score_abundance = case_when(
      total_abundance >= 100 ~ 3,
      total_abundance >= 50 ~ 2,
      total_abundance >= 30 ~ 1,
      TRUE ~ 0
    ),
    score_diversity = case_when(
      n_morphospecies >= 10 ~ 3,
      n_morphospecies >= 5 ~ 2,
      n_morphospecies >= 3 ~ 1,
      TRUE ~ 0
    ),
    score_occurrence = case_when(
      pct_sites >= 80 ~ 3,
      pct_sites >= 50 ~ 2,
      pct_sites >= 30 ~ 1,
      TRUE ~ 0
    ),
    score_variation = case_when(
      cv_abundance >= 50 ~ 3,
      cv_abundance >= 30 ~ 2,
      cv_abundance >= 15 ~ 1,
      TRUE ~ 0
    ),
    
    # Total score (max = 12)
    total_score = score_abundance + score_diversity + score_occurrence + score_variation,
    
    # Final recommendation
    recommendation = case_when(
      total_score >= 10 ~ "β˜…β˜…β˜… Highly recommended",
      total_score >= 7 ~ "β˜…β˜… Recommended",
      total_score >= 4 ~ "β˜… Possible with caution",
      TRUE ~ "βœ— Not recommended"
    )
  ) %>%
  arrange(desc(total_score))

# Display decision table
cat("\n========================================\n")
cat("   FOCAL TAXA RECOMMENDATION TABLE\n")
cat("========================================\n\n")

focal_taxa_scores %>%
  select(order, total_abundance, n_morphospecies, pct_sites, 
         cv_abundance, total_score, recommendation) %>%
  print(n = Inf)

4.3.5 Step 5: Visualize the Decision

# Create visual summary of recommendations
ggplot(focal_taxa_scores, aes(x = reorder(order, total_score), 
                               y = total_score,
                               fill = recommendation)) +
  geom_col() +
  geom_hline(yintercept = c(4, 7, 10), linetype = "dashed", alpha = 0.5) +
  coord_flip() +
  scale_fill_manual(values = c(
    "β˜…β˜…β˜… Highly recommended" = "#27ae60",
    "β˜…β˜… Recommended" = "#3498db",
    "β˜… Possible with caution" = "#f39c12",
    "βœ— Not recommended" = "#95a5a6"
  )) +
  labs(
    x = "Order",
    y = "Suitability Score (max = 12)",
    title = "Which Insect Groups Should You Analyze?",
    subtitle = "Based on abundance, diversity, occurrence, and habitat variation",
    fill = "Recommendation"
  ) +
  theme_bw() +
  theme(legend.position = "bottom")

4.4 Ecological Relevance of Common Pitfall Groups

Common Pitfall Trap Groups and Their Ecological Value:

Group Pitfall Suitability Ecological Role Indicator Value
Carabidae (ground beetles) β˜…β˜…β˜… Excellent Predators, seed eaters High - respond to disturbance, moisture
Formicidae (ants) β˜…β˜…β˜… Excellent Ecosystem engineers High - sensitive to temperature, disturbance
Araneae (spiders) β˜…β˜… Good Predators High - indicate prey availability
Staphylinidae (rove beetles) β˜…β˜… Good Predators, decomposers Moderate - very diverse
Scarabaeidae (dung beetles) β˜…β˜… Good Decomposers High - depend on mammal presence
Orthoptera (grasshoppers) β˜… Moderate Herbivores Moderate - vegetation-dependent
Diptera (flies) βœ— Poor Various Low - mostly fly away from traps
Lepidoptera (moths) βœ— Poor Herbivores, pollinators Low - rarely caught in pitfalls

4.5 Preview: Patterns in Your Chosen Group

# Once you've chosen a focal group (e.g., Coleoptera), preview the patterns

# Create community matrix for chosen group
focal_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%  # Replace with your chosen order
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(names_from = morphospecies, values_from = abundance, values_fill = 0) %>%
  column_to_rownames("site")

# Get environmental data
focal_env <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%
  column_to_rownames("site")
focal_env <- focal_env[rownames(focal_matrix), ]

# Quick NMDS preview (we'll learn this properly tomorrow!)
library(vegan)
set.seed(123)
nmds_preview <- metaMDS(focal_matrix, distance = "bray", k = 2, trymax = 50)

# Plot preview
nmds_scores <- as.data.frame(scores(nmds_preview, display = "sites"))
nmds_scores$habitat <- focal_env$habitat

ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, color = habitat)) +
  geom_point(size = 4) +
  stat_ellipse(linetype = 2) +
  labs(
    title = "Preview: Do Beetle Communities Differ by Habitat?",
    subtitle = paste("NMDS stress =", round(nmds_preview$stress, 3),
                     "- We'll interpret this properly tomorrow!"),
    color = "Habitat"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  coord_equal()

End of Day 1 Checkpoint:

Before tomorrow, you should have:

  1. βœ… Imported and cleaned your data
  2. βœ… Explored which insect orders are present
  3. βœ… Applied the decision framework to choose focal taxa
  4. βœ… Created a community matrix for your focal group
  5. βœ… Documented your focal taxa choice and rationale

Objects we’ll use tomorrow:

  • insect_data - The raw data
  • comm_matrix - Community matrix for your focal group
  • env_data - Environmental data matching the community matrix

END OF DAY 1


DAY 2: ANALYSIS & INTERPRETATION


5 Session 5: Diversity Metrics & Visualization

Time: 09:00 - 10:00 (60 minutes)

5.1 Day 2 Setup

# Load packages
library(tidyverse)
library(vegan)
library(ape)
set.seed(123)

# If starting fresh, recreate from Day 1:
insect_data <- read.csv("data/raw/sample_insect_data.csv")

# Community matrix (using beetles as our focal group)
comm_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(names_from = morphospecies, values_from = abundance, values_fill = 0) %>%
  column_to_rownames("site")

# Environmental data
env_data <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%
  column_to_rownames("site")
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]

# Verify alignment
stopifnot(all(rownames(comm_matrix) == rownames(env_data)))
cat("βœ“ Data loaded successfully!\n")
cat("  Community matrix:", nrow(comm_matrix), "sites Γ—", ncol(comm_matrix), "species\n")

5.2 Understanding Diversity: A Tale of Two Communities

Why Do We Need Multiple Diversity Metrics?

Consider two beetle communities, each with 100 individuals and 10 species:

Community A (Uneven):

Species 1: 91 individuals  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
Species 2:  1 individual   β–ˆ
Species 3:  1 individual   β–ˆ
... (7 more species with 1 each)

Community B (Even):

Species 1: 10 individuals  β–ˆβ–ˆβ–ˆβ–ˆ
Species 2: 10 individuals  β–ˆβ–ˆβ–ˆβ–ˆ
Species 3: 10 individuals  β–ˆβ–ˆβ–ˆβ–ˆ
... (7 more species with 10 each)

Both have richness = 10, but they’re very different! - Community A is dominated by one species (low evenness) - Community B is balanced (high evenness)

Different diversity indices capture different aspects of this structure.

5.3 Alpha Diversity Metrics

# SPECIES RICHNESS (S)
# Simply: How many species are present?
richness <- specnumber(comm_matrix)

# SHANNON DIVERSITY (H')
# Combines richness AND evenness
# Higher = more diverse OR more even
shannon <- diversity(comm_matrix, index = "shannon")

# SIMPSON DIVERSITY (1-D)
# Probability that two random individuals are different species
# Higher = more diverse
simpson <- diversity(comm_matrix, index = "simpson")

# EVENNESS (Pielou's J)
# How evenly are individuals distributed among species?
# J = H' / ln(S)
# J = 1 means perfectly even; J β†’ 0 means one species dominates
evenness <- shannon / log(richness)

# Compile results
alpha_diversity <- data.frame(
  site = rownames(comm_matrix),
  richness = richness,
  shannon = round(shannon, 3),
  simpson = round(simpson, 3),
  evenness = round(evenness, 3)
) %>%
  left_join(env_data %>% rownames_to_column("site"), by = "site")

print(alpha_diversity)

When to Use Each Index:

Index Best For Interpretation
Richness Simple comparisons, presence/absence data β€œHow many species?”
Shannon General diversity, most publications β€œHow uncertain is species identity?”
Simpson When dominance matters β€œWhat’s the chance two individuals differ?”
Evenness Understanding community structure β€œIs one species dominating?”

For most ecological studies, report at least richness AND Shannon.

5.4 Rarefaction: Correcting for Sampling Effort

The Sampling Problem:

You collected 500 beetles at Site A (found 25 species) and 100 beetles at Site B (found 15 species).

Is Site A really more diverse? Or did you just sample more?

Rarefaction asks: β€œIf I had caught only 100 beetles at Site A, how many species would I expect?”

# Check sample sizes (total individuals per site)
sample_sizes <- rowSums(comm_matrix)
cat("Sample sizes per site:\n")
print(sample_sizes)
cat("\nRange:", min(sample_sizes), "-", max(sample_sizes), "individuals\n")

# If sample sizes differ greatly, rarefaction is important!
if (max(sample_sizes) / min(sample_sizes) > 2) {
  cat("⚠️ Sample sizes vary by more than 2x. Rarefaction recommended!\n")
}

# Rarefy to smallest sample size
min_n <- min(sample_sizes)
rarefied_richness <- rarefy(comm_matrix, sample = min_n)

# Compare raw vs rarefied
comparison <- data.frame(
  site = rownames(comm_matrix),
  raw_richness = richness,
  sample_size = sample_sizes,
  rarefied_richness = round(rarefied_richness, 1),
  difference = round(richness - rarefied_richness, 1)
)
print(comparison)

# Rarefaction curves
rarecurve(comm_matrix, 
          step = 5,
          col = rainbow(nrow(comm_matrix)),
          lwd = 2,
          xlab = "Number of Individuals Sampled",
          ylab = "Species Richness",
          main = "Rarefaction Curves: Are Sites Adequately Sampled?")
abline(v = min_n, lty = 2, col = "gray")
legend("bottomright", legend = rownames(comm_matrix),
       col = rainbow(nrow(comm_matrix)), lwd = 2, cex = 0.7)

Reading Rarefaction Curves:

  • Steep curve: Still discovering new species (undersampled)
  • Flattening curve: Approaching true richness (well-sampled)
  • Compare at same x-value: Fair comparison of richness

If curves are still steep at your sample sizes, you probably haven’t captured the full community!

5.5 Visualizing and Testing Diversity Differences

# Visualize diversity by habitat
ggplot(alpha_diversity, aes(x = habitat, y = shannon, fill = habitat)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.1, size = 3, alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    x = "Habitat",
    y = "Shannon Diversity (H')",
    title = "Does Beetle Diversity Differ Between Habitats?"
  ) +
  theme_bw() +
  theme(legend.position = "none")

Why Kruskal-Wallis Test?

We use Kruskal-Wallis instead of ANOVA because:

  1. Small sample sizes per group (often n < 10)
  2. Diversity data often non-normal
  3. It’s a β€œnon-parametric” test - fewer assumptions

Kruskal-Wallis tests whether the RANKS differ between groups.

# Statistical test: Do habitats differ in diversity?
kruskal_result <- kruskal.test(shannon ~ habitat, data = alpha_diversity)
print(kruskal_result)

# Interpretation
cat("\n=== INTERPRETATION ===\n")
if (kruskal_result$p.value < 0.05) {
  cat("βœ“ Significant difference in Shannon diversity between habitats (p =",
      round(kruskal_result$p.value, 4), ")\n")
  
  # Post-hoc pairwise tests
  cat("\nPairwise comparisons:\n")
  pairwise.wilcox.test(alpha_diversity$shannon, alpha_diversity$habitat,
                       p.adjust.method = "bonferroni")
} else {
  cat("βœ— No significant difference detected (p =", 
      round(kruskal_result$p.value, 4), ")\n")
  cat("  This doesn't mean habitats are equal - we may lack statistical power.\n")
}

β˜• BREAK: 10:00 - 10:15


6 Session 6: Ordination - Seeing Community Patterns

Time: 10:15 - 11:15 (60 minutes)

6.1 Why Ordination?

The Challenge of Multivariate Data:

Your community matrix has many species (columns). Each site exists in a high-dimensional β€œspecies space.”

Ordination reduces these many dimensions to 2-3 that we can visualize, while preserving the important patterns.

Key insight: Sites that are close together in an ordination plot have similar species compositions. Sites far apart are different.

6.2 The Foundation: Distance Matrices

# First, we calculate how DIFFERENT each pair of sites is
# Bray-Curtis dissimilarity is the standard for abundance data

dist_bray <- vegdist(comm_matrix, method = "bray")

# View as matrix (first few sites)
cat("Bray-Curtis Dissimilarity Matrix:\n")
print(round(as.matrix(dist_bray)[1:4, 1:4], 3))

Why Bray-Curtis Dissimilarity?

Dissimilarity Formula Idea Best For
Bray-Curtis Compares shared vs total abundance Abundance data (most common)
Jaccard Compares shared vs total species Presence/absence data
Euclidean Straight-line distance Continuous environmental data

Bray-Curtis is preferred because:

  1. Range: 0 (identical) to 1 (completely different)
  2. Two sites with no species in common = maximum dissimilarity
  3. Joint absences don’t make sites similar (unlike Euclidean)

6.3 NMDS vs PCoA: Which to Choose?

╔═══════════════════════════════════════════════════════════════════════════════╗
β•‘                    ORDINATION METHOD DECISION TREE                             β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

                    What type of data do you have?
                                    β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚                     β”‚                     β”‚
              β–Ό                     β–Ό                     β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚ Abundanceβ”‚         β”‚ Presence/β”‚         β”‚ Environ- β”‚
        β”‚ data     β”‚         β”‚ Absence  β”‚         β”‚ mental   β”‚
        β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
             β”‚                    β”‚                    β”‚
             β–Ό                    β–Ό                    β–Ό
        Bray-Curtis          Jaccard            Consider PCA
        distance             distance           (different method)
             β”‚                    β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
        What matters most for your question?
                        β”‚
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚                         β”‚
           β–Ό                         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Preserving   β”‚         β”‚ Knowing %    β”‚
    β”‚ RELATIVE     β”‚         β”‚ VARIANCE     β”‚
    β”‚ positions    β”‚         β”‚ explained    β”‚
    β”‚ (which sites β”‚         β”‚ by each axis β”‚
    β”‚ are similar) β”‚         β”‚              β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                        β”‚
           β–Ό                        β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚  NMDS  β”‚              β”‚  PCoA  β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                        β”‚
           β–Ό                        β–Ό
    "The ecology                "Microbiome
     standard"                   standard"
    Most published               Reports %
    ecology studies              variance

Quick Decision Guide:

Feature NMDS PCoA
What it preserves Rank order of distances Actual distances
Axes meaning Arbitrary (no % variance) % variance explained
Reproducibility May vary (use set.seed!) Always the same
Best for Community ecology When % variance matters
Stress/Quality Report stress value Report eigenvalues

For most ecological studies: Use NMDS. It’s the community standard and handles ecological data well.

Use PCoA when: You need to report % variance explained, or for microbiome data or when your data mostly consist of singletons or 0.

6.4 NMDS: Non-metric Multidimensional Scaling

# Run NMDS
set.seed(123)  # For reproducibility - NMDS uses random starts!

nmds_result <- metaMDS(
  comm_matrix,
  distance = "bray",     # Bray-Curtis dissimilarity
  k = 2,                 # 2 dimensions for visualization
  trymax = 100           # Try 100 random starting positions
)

# Check the stress value
cat("\n=== NMDS RESULTS ===\n")
cat("Stress value:", round(nmds_result$stress, 4), "\n")

Interpreting NMDS Stress:

Stress measures how well the 2D plot represents the actual multi-dimensional distances.

Stress Quality Interpretation
< 0.05 Excellent Almost perfect representation
0.05 - 0.10 Good Safe to interpret
0.10 - 0.20 Acceptable Interpret with some caution
0.20 - 0.30 Poor Distances may be misleading
> 0.30 Very poor Do not interpret!

Always report stress in publications!

# Interpret stress value
if (nmds_result$stress < 0.10) {
  cat("βœ“ Good fit! The 2D plot accurately represents community distances.\n")
} else if (nmds_result$stress < 0.20) {
  cat("⚠ Acceptable fit. Interpret patterns with some caution.\n")
} else {
  cat("βœ— Poor fit. Consider using 3 dimensions (k=3) or different approach.\n")
}

# Stressplot shows how well ordination distances match original distances
stressplot(nmds_result, main = "Shepard Diagram: How Well Does NMDS Fit?")

6.5 Creating NMDS Plots

# Extract site scores
nmds_scores <- as.data.frame(scores(nmds_result, display = "sites"))
nmds_scores$site <- rownames(nmds_scores)

# Add environmental data
nmds_scores <- nmds_scores %>%
  left_join(env_data %>% rownames_to_column("site"), by = "site")

# Create plot
p_nmds <- ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, 
                                   color = habitat, shape = landscape)) +
  geom_point(size = 4, alpha = 0.8) +
  stat_ellipse(aes(group = habitat), linetype = 2, level = 0.95) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "NMDS: Beetle Community Composition",
    subtitle = paste("Stress =", round(nmds_result$stress, 3)),
    color = "Habitat",
    shape = "Landscape"
  ) +
  theme_bw() +
  coord_equal()  # IMPORTANT: Equal scaling on both axes!

print(p_nmds)

Reading NMDS Plots:

  • Close points = Similar beetle communities
  • Distant points = Different beetle communities
  • Cluster by color = Habitats have distinct communities
  • Overlapping ellipses = Communities are similar between habitats
  • Separated ellipses = Communities differ between habitats

6.6 PCoA: Principal Coordinates Analysis

# Run PCoA
pcoa_result <- pcoa(dist_bray)

# Check variance explained
eigenvalues <- pcoa_result$values$Eigenvalues
var_explained <- eigenvalues / sum(abs(eigenvalues)) * 100

cat("\n=== PCoA VARIANCE EXPLAINED ===\n")
cat("Axis 1:", round(var_explained[1], 1), "%\n")
cat("Axis 2:", round(var_explained[2], 1), "%\n")
cat("Axes 1+2:", round(sum(var_explained[1:2]), 1), "%\n")

# Extract scores and create plot
pcoa_scores <- as.data.frame(pcoa_result$vectors[, 1:2])
colnames(pcoa_scores) <- c("PCoA1", "PCoA2")
pcoa_scores$site <- rownames(pcoa_scores)
pcoa_scores <- pcoa_scores %>%
  left_join(env_data %>% rownames_to_column("site"), by = "site")

p_pcoa <- ggplot(pcoa_scores, aes(x = PCoA1, y = PCoA2, color = habitat)) +
  geom_point(size = 4, alpha = 0.8) +
  stat_ellipse(linetype = 2) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "PCoA: Beetle Community Composition",
    x = paste0("PCoA1 (", round(var_explained[1], 1), "%)"),
    y = paste0("PCoA2 (", round(var_explained[2], 1), "%)"),
    color = "Habitat"
  ) +
  theme_bw() +
  coord_equal()

print(p_pcoa)

🍽️ LUNCH: 11:15 - 12:15


7 Session 7: Statistical Testing

Time: 12:15 - 13:30 (75 minutes)

7.1 The Statistical Testing Workflow

╔═══════════════════════════════════════════════════════════════════════════════╗
β•‘               COMMUNITY ANALYSIS WORKFLOW                                      β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  STEP 1: VISUALIZE (NMDS/PCoA)                          β”‚
        β”‚  "Do the groups LOOK different?"                        β”‚
        β”‚  If no visible pattern, statistical test will likely    β”‚
        β”‚  be non-significant too.                                β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  STEP 2: TEST SIGNIFICANCE (PERMANOVA)                  β”‚
        β”‚  "Is the difference STATISTICALLY SIGNIFICANT?"         β”‚
        β”‚  p < 0.05 means unlikely to be due to chance.           β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  STEP 3: CHECK EFFECT SIZE (RΒ²)                         β”‚
        β”‚  "HOW MUCH of the variation does habitat explain?"      β”‚
        β”‚  A significant p-value with tiny RΒ² = trivial effect.   β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  STEP 4: CHECK ASSUMPTIONS (betadisper)                 β”‚
        β”‚  "Is the test valid?"                                   β”‚
        β”‚  PERMANOVA assumes groups have similar spread.          β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  STEP 5: IDENTIFY KEY SPECIES (SIMPER)                  β”‚
        β”‚  "WHICH species drive the differences?"                 β”‚
        β”‚  Connects statistics back to biology.                   β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

7.2 PERMANOVA: Testing for Community Differences

Why PERMANOVA Instead of ANOVA?

  • ANOVA tests one variable at a time (e.g., β€œDoes richness differ?”)
  • PERMANOVA tests the entire community simultaneously (e.g., β€œDoes species composition differ?”)

PERMANOVA = Permutational Multivariate Analysis of Variance

It works by:

  1. Calculating distances between all sites (Bray-Curtis)
  2. Testing whether within-group distances are smaller than between-group distances
  3. Using permutations (shuffling labels) to assess significance
# PERMANOVA: Do beetle communities differ between habitats?
permanova_result <- adonis2(
  comm_matrix ~ habitat * landscape,  # Test habitat, landscape, and interaction
  data = env_data,
  method = "bray",                    # Bray-Curtis dissimilarity
  permutations = 999                  # 999 permutations for p-value
)

print(permanova_result)

7.2.1 Interpreting PERMANOVA Output

cat("\n=== PERMANOVA INTERPRETATION ===\n\n")

# Extract key values
r2_habitat <- permanova_result$R2[1]
f_habitat <- permanova_result$F[1]
p_habitat <- permanova_result$`Pr(>F)`[1]

cat("HABITAT EFFECT:\n")
cat("  F-value:", round(f_habitat, 2), "\n")
cat("  RΒ² =", round(r2_habitat, 3), 
    "(", round(r2_habitat * 100, 1), "% of variation explained)\n")
cat("  p-value:", p_habitat, "\n\n")

# Significance interpretation
if (p_habitat < 0.001) {
  cat("  β˜… Highly significant (p < 0.001)\n")
} else if (p_habitat < 0.01) {
  cat("  β˜… Very significant (p < 0.01)\n")
} else if (p_habitat < 0.05) {
  cat("  β˜… Significant (p < 0.05)\n")
} else {
  cat("  βœ— Not significant (p β‰₯ 0.05)\n")
}

# Effect size interpretation
cat("\nEFFECT SIZE INTERPRETATION:\n")
if (r2_habitat < 0.05) {
  cat("  Effect is TINY - statistically significant but ecologically trivial\n")
} else if (r2_habitat < 0.10) {
  cat("  Effect is SMALL - habitat explains a minor portion\n")
} else if (r2_habitat < 0.25) {
  cat("  Effect is MEDIUM - habitat is an important factor\n")
} else {
  cat("  Effect is LARGE - habitat is a major driver of community structure\n")
}

Effect Size (RΒ²) Guide:

RΒ² Interpretation Example
< 0.05 Tiny Habitat explains less than 5% of variation
0.05-0.10 Small Habitat is one of many factors
0.10-0.25 Medium Habitat is an important driver
> 0.25 Large Habitat is a dominant factor

Always report both p-value AND RΒ²! A significant p-value with tiny RΒ² means the effect is real but not very important.

7.3 Checking Assumptions: Dispersion

Why Check Dispersion?

PERMANOVA assumes that groups have similar within-group variation (dispersion).

If one habitat has highly variable communities and another is very consistent, PERMANOVA might detect this as a β€œdifference” even if the average compositions are identical!

betadisper tests this assumption.

# Test for homogeneity of dispersion
dispersion <- betadisper(dist_bray, env_data$habitat)
dispersion_test <- permutest(dispersion, permutations = 999)

cat("\n=== DISPERSION TEST ===\n")
print(dispersion_test)

# Interpretation
if (dispersion_test$tab$`Pr(>F)`[1] > 0.05) {
  cat("\nβœ“ Dispersions are homogeneous (p > 0.05).\n")
  cat("  PERMANOVA assumption is met - results are reliable.\n")
} else {
  cat("\n⚠ Dispersions are UNEQUAL (p < 0.05)!\n")
  cat("  PERMANOVA results may be influenced by dispersion differences.\n")
  cat("  Interpret with caution.\n")
}

# Visualize dispersion
boxplot(dispersion, main = "Distance to Group Centroid by Habitat",
        xlab = "Habitat", ylab = "Distance to centroid")

7.4 SIMPER: Which Species Matter?

SIMPER = Similarity Percentages

PERMANOVA tells us communities differ, but not WHY.

SIMPER identifies which species contribute most to the differences.

This connects statistics back to biology!

# SIMPER analysis
simper_result <- simper(comm_matrix, env_data$habitat, permutations = 999)

cat("\n=== SIMPER RESULTS ===\n")
cat("Top species contributing to habitat differences:\n\n")
summary(simper_result)

Reading SIMPER Output:

Column Meaning
average Average contribution to dissimilarity
sd Standard deviation of contribution
ratio average/sd - consistency of contribution
cumsum Cumulative sum (% explained so far)

Key indicators:

  • High average + low sd = Consistent differentiator
  • ratio > 1 = Reliable indicator species
  • Look at species contributing to ~70% cumsum

β˜• BREAK: 13:30 - 13:45


8 Session 8: Interpretation & Synthesis

Time: 13:45 - 14:45 (60 minutes)

8.1 The Interpretation Framework

For Every Statistical Result, Ask:

  1. WHAT did we find? (The statistical result)
  2. HOW BIG is the effect? (Effect size, not just p-value)
  3. WHY might this happen? (Ecological mechanisms)
  4. SO WHAT? (Conservation/management implications)
  5. WHAT NEXT? (Limitations, new questions)

8.2 Example Interpretation

Example Finding: Beetle communities differ significantly between habitats (PERMANOVA: RΒ² = 0.32, p = 0.001, stress = 0.12)


1. WHAT did we find?

Beetle species composition differs between forest, grassland, and agricultural habitats. About 32% of the total variation in beetle communities is explained by habitat type alone.


2. HOW BIG is the effect?

This is a large effect (RΒ² > 0.25). Habitat is a major driver of beetle community structure. The NMDS stress of 0.12 indicates the pattern is reliably represented in 2D.


3. WHY might this happen? (Ecological mechanisms)

  • Habitat filtering: Forest provides leaf litter, shade, stable moisture - conditions that favor certain beetle groups (many Carabidae, Staphylinidae)
  • Resource availability: Different prey, detritus, and microhabitats in each habitat
  • Microclimate: Forests buffer temperature extremes; grasslands experience wider fluctuations
  • Disturbance regime: Agriculture has regular disturbance (plowing, pesticides) that selects for mobile generalist species
  • Historical factors: Forest beetles may have limited dispersal ability, excluding them from agricultural matrices

4. SO WHAT? (Implications)

  • Conservation: Forest fragments in agricultural landscapes may be critical refugia for specialized beetle species
  • Pest management: Agricultural communities may lack natural enemies present in forests
  • Restoration: Simply recreating habitat structure may not restore beetle communities if source populations are absent
  • Monitoring: Beetles could serve as indicators of habitat quality

5. WHAT NEXT? (Limitations and new questions)

Limitations: - Single sampling season (seasonal turnover not captured) - Pitfall traps favor ground-active species (canopy beetles missed) - Morphospecies identification may underestimate true diversity - No data on habitat quality within types

New questions: - Which specific environmental variables (temperature, leaf litter depth, soil moisture) drive beetle distributions? - Does forest fragment size affect beetle community composition? - How quickly do beetle communities recover after habitat restoration? - Are there β€œspillover” effects at habitat edges?

8.3 Creating Summary Results

# Create results summary table
results_summary <- data.frame(
  Analysis = c(
    "Alpha Diversity (Shannon)",
    "NMDS Ordination",
    "PERMANOVA (Habitat)",
    "PERMANOVA (Landscape)", 
    "Dispersion Test"
  ),
  Statistic = c(
    paste("Kruskal χ² =", round(kruskal_result$statistic, 2)),
    paste("Stress =", round(nmds_result$stress, 3)),
    paste("F =", round(permanova_result$F[1], 2)),
    paste("F =", round(permanova_result$F[2], 2)),
    paste("F =", round(dispersion_test$tab$F[1], 2))
  ),
  Effect_Size = c(
    "See boxplot",
    "Visual separation",
    paste("RΒ² =", round(permanova_result$R2[1], 3)),
    paste("RΒ² =", round(permanova_result$R2[2], 3)),
    "NA"
  ),
  P_value = c(
    round(kruskal_result$p.value, 4),
    "NA",
    permanova_result$`Pr(>F)`[1],
    permanova_result$`Pr(>F)`[2],
    round(dispersion_test$tab$`Pr(>F)`[1], 4)
  ),
  Interpretation = c(
    ifelse(kruskal_result$p.value < 0.05, "Significant", "Not significant"),
    ifelse(nmds_result$stress < 0.2, "Good fit", "Poor fit"),
    ifelse(permanova_result$`Pr(>F)`[1] < 0.05, "Communities differ", "No difference"),
    ifelse(permanova_result$`Pr(>F)`[2] < 0.05, "Effect of landscape", "No effect"),
    ifelse(dispersion_test$tab$`Pr(>F)`[1] > 0.05, "Assumption met", "Assumption violated")
  )
)

cat("\n╔═══════════════════════════════════════════════════════════════╗\n")
cat("β•‘                    RESULTS SUMMARY TABLE                       β•‘\n")
cat("β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•\n\n")
print(results_summary, row.names = FALSE)

8.4 Final Publication Figure

# Create comprehensive figure
p_final <- ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, 
                                    color = habitat, shape = landscape)) +
  geom_point(size = 5, alpha = 0.8) +
  stat_ellipse(aes(group = habitat), linetype = 2, linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  scale_shape_manual(values = c(16, 17)) +
  labs(
    title = "Beetle Community Composition Across Habitats",
    subtitle = sprintf("NMDS stress = %.3f | PERMANOVA: RΒ² = %.2f, p = %.3f",
                       nmds_result$stress,
                       permanova_result$R2[1],
                       permanova_result$`Pr(>F)`[1]),
    color = "Habitat",
    shape = "Landscape"
  ) +
  theme_bw(base_size = 14) +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(size = 10)
  ) +
  coord_equal()

print(p_final)

# Save
ggsave("figures/nmds_final.png", p_final, width = 10, height = 8, dpi = 300)

8.5 Complete Analysis Template

#============================================================================
# COMPLETE ANALYSIS SCRIPT TEMPLATE
# Copy and modify for your own analyses
#============================================================================

# 1. SETUP ----
rm(list = ls())
library(tidyverse)
library(vegan)
library(ape)
set.seed(42)

# 2. IMPORT DATA ----
raw_data <- read.csv("data/raw/sample_insect_data.csv")

# 3. EXPLORE AND CHOOSE FOCAL TAXA ----
# [Run decision framework from Session 4]

# 4. PREPARE DATA ----
comm_matrix <- raw_data %>%
  filter(order == "Coleoptera") %>%  # Your chosen focal group
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(names_from = morphospecies, values_from = abundance, values_fill = 0) %>%
  column_to_rownames("site")

env_data <- raw_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%
  column_to_rownames("site")
env_data <- env_data[rownames(comm_matrix), ]

# 5. ALPHA DIVERSITY ----
alpha_div <- data.frame(
  site = rownames(comm_matrix),
  richness = specnumber(comm_matrix),
  shannon = diversity(comm_matrix, "shannon")
) %>%
  left_join(env_data %>% rownames_to_column("site"), by = "site")

kruskal.test(shannon ~ habitat, data = alpha_div)

# 6. BETA DIVERSITY ----
dist_bray <- vegdist(comm_matrix, method = "bray")
nmds <- metaMDS(comm_matrix, distance = "bray", k = 2, trymax = 100)
cat("NMDS Stress:", nmds$stress, "\n")

# 7. STATISTICAL TESTS ----
permanova <- adonis2(comm_matrix ~ habitat * landscape, 
                     data = env_data, method = "bray", permutations = 999)
print(permanova)

betadisper(dist_bray, env_data$habitat) %>% permutest()

simper(comm_matrix, env_data$habitat)

# 8. SAVE OUTPUTS ----
write.csv(alpha_div, "output/alpha_diversity.csv", row.names = FALSE)
ggsave("figures/nmds_plot.png", width = 8, height = 6, dpi = 300)

Wrap-up & Next Steps

Time: 14:45 - 15:00 (15 minutes)

What You Learned

Day 1:

  • βœ… R basics and data import
  • βœ… Data wrangling with tidyverse
  • βœ… Choosing focal taxa - the decision framework
  • βœ… Creating community matrices

Day 2:

  • βœ… Diversity metrics and their ecological meaning
  • βœ… Rarefaction for sampling effort
  • βœ… NMDS vs PCoA - when to use which
  • βœ… PERMANOVA - testing community differences
  • βœ… SIMPER - identifying key species
  • βœ… Ecological interpretation of results

Resources for Continued Learning

Books:

  • Numerical Ecology with R (Borcard, Gillet & Legendre) - The bible for this topic
  • R for Data Science (Wickham & Grolemund) - Free online: r4ds.had.co.nz

Online:

Packages to explore:

  • BiodiversityR - More biodiversity analysis tools
  • indicspecies - Indicator species analysis
  • iNEXT - Diversity extrapolation

Quick Reference Card

╔══════════════════════════════════════════════════════════════════════════════╗
β•‘                    R FOR INSECT ECOLOGY - QUICK REFERENCE                    β•‘
╠══════════════════════════════════════════════════════════════════════════════╣
β•‘                                                                              β•‘
β•‘ DATA WRANGLING (dplyr)                                                       β•‘
β•‘ ─────────────────────────────────────────────────────────────────────────    β•‘
β•‘ filter(data, condition)          Keep rows matching condition                β•‘
β•‘ select(data, columns)            Keep selected columns                       β•‘
β•‘ mutate(data, new_col = ...)      Create new columns                          β•‘
β•‘ group_by() %>% summarise()       Aggregate by groups                         β•‘
β•‘ pivot_wider() / pivot_longer()   Reshape data                                β•‘
β•‘                                                                              β•‘
β•‘ DIVERSITY METRICS (vegan)                                                    β•‘
β•‘ ─────────────────────────────────────────────────────────────────────────    β•‘
β•‘ specnumber(x)                    Species richness                            β•‘
β•‘ diversity(x, "shannon")          Shannon H'                                  β•‘
β•‘ diversity(x, "simpson")          Simpson 1-D                                 β•‘
β•‘ rarefy(x, sample = n)            Rarefied richness                           β•‘
β•‘                                                                              β•‘
β•‘ ORDINATION                                                                   β•‘
β•‘ ─────────────────────────────────────────────────────────────────────────    β•‘
β•‘ vegdist(x, "bray")               Bray-Curtis dissimilarity                   β•‘
β•‘ metaMDS(x, k = 2)                NMDS (report stress!)                       β•‘
β•‘ pcoa(dist)                       PCoA (report % variance)                    β•‘
β•‘                                                                              β•‘
β•‘ STATISTICAL TESTS                                                            β•‘
β•‘ ─────────────────────────────────────────────────────────────────────────    β•‘
β•‘ adonis2(matrix ~ factor)         PERMANOVA (report RΒ² AND p!)                β•‘
β•‘ betadisper(dist, group)          Check dispersion assumption                 β•‘
β•‘ simper(matrix, group)            Identify key species                        β•‘
β•‘                                                                              β•‘
β•‘ INTERPRETATION GUIDE                                                         β•‘
β•‘ ─────────────────────────────────────────────────────────────────────────    β•‘
β•‘ NMDS Stress:  < 0.10 good | 0.10-0.20 okay | > 0.20 poor                     β•‘
β•‘ PERMANOVA RΒ²: < 0.05 tiny | 0.05-0.10 small | 0.10-0.25 med | > 0.25 large   β•‘
β•‘                                                                              β•‘
β•‘ DECISION CHECKLIST FOR FOCAL TAXA                                            β•‘
β•‘ ─────────────────────────────────────────────────────────────────────────    β•‘
β•‘ β–‘ Total abundance β‰₯ 50?                                                      β•‘
β•‘ β–‘ Number of species β‰₯ 5?                                                     β•‘
β•‘ β–‘ Present in β‰₯ 50% of sites?                                                 β•‘
β•‘ β–‘ Shows habitat variation (CV > 30%)?                                        β•‘
β•‘ β–‘ Ecologically relevant for pitfall sampling?                                β•‘
β•‘                                                                              β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Happy coding and happy bug hunting! πŸ›