This workshop is designed for entomology students who are new to R and may need a refresher on ecological concepts related to biodiversity data analysis. No prior R experience is required!
By the end of this workshop, you will be able to:
When we set pitfall traps across different habitats, weβre fundamentally asking:
βDo different environments support different insect communities, and why?β
This question connects to foundational ecological theories:
Key Ecological Concepts:
Niche Theory: Each species has specific environmental requirements (temperature, moisture, food). Species occur where conditions match their niche.
Habitat Filtering: The environment βfiltersβ which species can survive. Forest conditions filter out sun-loving species; agricultural disturbance filters out sensitive specialists.
Species-Area Relationship: Larger, more complex habitats typically support more species.
Indicator Species: Some species are sensitive to environmental conditions and can indicate habitat quality or disturbance level.
Our statistical analyses are tools to detect these ecological patterns and test whether the differences we observe are real or just due to chance.
Before writing any code, understand the journey from raw data to ecological insights:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMPLETE ANALYSIS WORKFLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββ
β RAW DATA β
β (Your pitfall β
β trap counts) β
ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PHASE 1: DATA PREPARATION (Day 1, Sessions 1-3) β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£ β
β β β’ Import data into R β β
β β β’ Check for errors (typos, impossible values) β β
β β β’ Standardize names (taxonomy is messy!) β β
β β β’ Create community matrix (sites Γ species) β β
β β β β
β β KEY QUESTION: "Is my data clean and trustworthy?" β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PHASE 2: DATA EXPLORATION (Day 1, Session 4) β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£ β
β β β’ Summarize: How many species? How many individuals? β β
β β β’ Explore: Which insect ORDERS are abundant? β β
β β β’ Decide: Which groups have "interesting stories"? β β
β β β’ Visualize: What patterns can we SEE? β β
β β β β
β β KEY QUESTION: "Which insect groups should I focus β β
β β my analysis on?" β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
D β βΌ β
A β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
Y β β PHASE 3: DIVERSITY ANALYSIS (Day 2, Session 5) β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£ β
1 β β ALPHA DIVERSITY (within-site) β β
β β β’ How many species at each site? (Richness) β β
| β β β’ How diverse is each site? (Shannon, Simpson) β β
β β β’ Are differences due to sampling effort? β β
D β β (Rarefaction) β β
A β β β β
Y β β KEY QUESTION: "Which habitats are most diverse?" β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2 β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PHASE 4: COMMUNITY ANALYSIS (Day 2, Sessions 6-7) β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£ β
β β BETA DIVERSITY (between-sites) β β
β β β’ How different are communities? (Distance matrix) β β
β β β’ Can we VISUALIZE the pattern? (NMDS/PCoA) β β
β β β’ Is the pattern SIGNIFICANT? (PERMANOVA) β β
β β β’ WHICH species drive differences? (SIMPER) β β
β β β β
β β KEY QUESTION: "Do habitats have different β β
β β communities? Which species matter?" β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PHASE 5: INTERPRETATION (Day 2, Session 8) β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£ β
β β β’ What do the numbers MEAN ecologically? β β
β β β’ What MECHANISMS might explain the patterns? β β
β β β’ What are the LIMITATIONS? β β
β β β’ What NEW QUESTIONS arise? β β
β β β β
β β KEY QUESTION: "So what? Why does this matter?" β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββ
β PUBLICATION / β
β THESIS CHAPTER β
βββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DAY 1: DATA FOUNDATIONS & EXPLORATION 9:00 - 15:00 β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β 09:00 - 10:15 Session 1: R Basics & Importing Data 75 min β
β 10:15 - 10:30 β Break 15 min β
β 10:30 - 11:30 Session 2: Data Wrangling with tidyverse 60 min β
β 11:30 - 12:30 π½οΈ Lunch 60 min β
β 12:30 - 13:45 Session 3: Data Cleaning & Community Matrices 75 min β
β 13:45 - 14:00 β Break 15 min β
β 14:00 - 15:00 Session 4: Choosing Your Focal Taxa 60 min β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β DAY 2: ANALYSIS & INTERPRETATION 9:00 - 15:00 β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β 09:00 - 10:00 Session 5: Diversity Metrics & Visualization 60 min β
β 10:00 - 10:15 β Break 15 min β
β 10:15 - 11:15 Session 6: Ordination - Seeing Community Patterns 60 min β
β 11:15 - 12:15 π½οΈ Lunch 60 min β
β 12:15 - 13:30 Session 7: Statistical Testing 75 min β
β 13:30 - 13:45 β Break 15 min β
β 13:45 - 14:45 Session 8: Interpretation & Synthesis 60 min β
β 14:45 - 15:00 Wrap-up & Next Steps 15 min β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Weβll use sample_insect_data.csv - a
pitfall trap survey of ground-dwelling arthropods.
About Pitfall Traps:
Pitfall traps are cups buried flush with the ground surface. Ground-active arthropods fall in and are collected. This method is:
Always consider sampling bias when interpreting results!
| Column | Description |
|---|---|
| site | Site identifier (S01-S12) |
| habitat | Habitat type (forest, grassland, agriculture) |
| landscape | Landscape complexity (complex, simple) |
| order | Insect order (Coleoptera, Hymenoptera, etc.) |
| family | Insect family |
| morphospecies | Species identifier |
| abundance | Number of individuals |
Time: 09:00 - 10:15 (75 minutes)
Why do we use R instead of Excel?
Reproducibility: Your analysis is saved as a script. Anyone (including future you) can re-run it exactly.
Ecological packages: R has specialized packages
like vegan designed specifically for community
ecology.
Handling complexity: Ecological data often has hundreds of species and complex designs. R handles this easily.
Publication-quality figures: R creates beautiful, customizable plots ready for journals.
# Packages are collections of functions that extend R's capabilities
# We load them at the start of EVERY session
library(tidyverse) # Data wrangling and visualization
library(vegan) # Community ecology analyses (diversity, ordination)
library(ape) # Phylogenetic tools and PCoA ordination
# If you get "there is no package called 'x'", install it first:
# install.packages("x")Why these packages?
filter, select,
mutate# R as a calculator
2 + 3 # Addition
10 / 2 # Division
5^2 # Power (5 squared = 25)
sqrt(16) # Square root
# Creating objects with <- (assignment operator)
species_count <- 42 # Store a number
site_name <- "Forest_A" # Store text
# Vectors: collections of values (fundamental to R!)
abundance <- c(12, 45, 23, 8, 67, 34)
species <- c("Carabidae_sp1", "Carabidae_sp2", "Staphylinidae_sp1")
# Basic operations on vectors
length(abundance) # How many elements? β 6
sum(abundance) # Total β 189
mean(abundance) # Average β 31.5
max(abundance) # Maximum β 67
min(abundance) # Minimum β 8
# Extracting elements (indexing)
abundance[1] # First element β 12
abundance[c(1, 3, 5)] # Elements 1, 3, and 5
abundance[abundance > 30] # Only values > 30 β 45, 67, 34# Import our main dataset
# Make sure the file is in your working directory or provide full path
insect_data <- read.csv("data/raw/sample_insect_data.csv")
# More robust import with options for common issues:
insect_data <- read.csv(
"data/raw/sample_insect_data.csv",
header = TRUE, # First row contains column names
stringsAsFactors = FALSE, # Keep text as text (not factors)
na.strings = c("", "NA", "N/A", "-") # Recognize these as missing data
)Why explore data immediately?
Before any analysis, you need to know:
Never trust data blindly - always look at it first!
# First look at your data
head(insect_data) # First 6 rows
tail(insect_data) # Last 6 rows
View(insect_data) # Spreadsheet view (capital V!)
# Structure and dimensions
str(insect_data) # Column types and preview
dim(insect_data) # Rows Γ columns
nrow(insect_data) # Number of rows
ncol(insect_data) # Number of columns
names(insect_data) # Column names
# Summary statistics
summary(insect_data) # Statistical summary of each column# Missing values (NA = "Not Available")
sum(is.na(insect_data)) # Total NAs in entire dataset
colSums(is.na(insect_data)) # NAs per column
# Check categorical columns for typos
unique(insect_data$habitat) # What habitat values exist?
unique(insect_data$order) # What insect orders are present?
# Frequency tables
table(insect_data$habitat) # How many records per habitat?
table(insect_data$order) # How many records per order?
# Check numeric ranges for impossible values
range(insect_data$abundance) # Min and max abundance
summary(insect_data$abundance) # Full summary
# Look for suspicious values
insect_data[insect_data$abundance < 0, ] # Negative abundances?β BREAK: 10:15 - 10:30
Time: 10:30 - 11:30 (60 minutes)
Why use pipes (%>%)?
Pipes make code readable by turning nested operations into a
step-by-step recipe. Read %>% as
βTHENβ.
Instead of: βBake(Mix(Measure(flour, sugar), eggs))β
Write: βTake flour and sugar, THEN measure, THEN mix with eggs, THEN bakeβ
library(tidyverse)
# Without pipe - confusing nested functions:
round(mean(sqrt(c(1, 4, 9, 16))), 2)
# With pipe - clear step-by-step:
c(1, 4, 9, 16) %>% # Start with these numbers
sqrt() %>% # THEN take square root
mean() %>% # THEN calculate mean
round(2) # THEN round to 2 decimals
# Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)Think of these as actions you perform on your data:
| Verb | Action | Example Use |
|---|---|---|
filter() |
Keep rows that match a condition | Keep only forest sites |
select() |
Keep or remove columns | Keep only site, species, abundance |
mutate() |
Create new columns | Calculate log-transformed abundance |
arrange() |
Sort rows | Sort by abundance (high to low) |
summarise() |
Collapse to summary statistics | Calculate mean abundance |
group_by() |
Group data for operations | Calculate mean BY habitat |
# Keep only forest sites
insect_data %>%
filter(habitat == "forest")
# Multiple conditions with AND (comma or &)
insect_data %>%
filter(habitat == "forest", abundance > 10)
# Multiple options with OR (|) or %in%
insect_data %>%
filter(habitat %in% c("forest", "grassland"))
# Exclude with !=
insect_data %>%
filter(habitat != "agriculture")# Keep specific columns
insect_data %>%
select(site, habitat, morphospecies, abundance)
# Remove columns with minus sign
insect_data %>%
select(-family)
# Select helpers for patterns
insect_data %>%
select(starts_with("site")) # Columns starting with "site"
insect_data %>%
select(contains("species")) # Columns containing "species"Why is this so powerful?
In ecology, we almost always want to compare GROUPS (habitats,
treatments, sites). group_by() tells R to do calculations
separately for each group.
# Calculate summary statistics BY habitat
insect_data %>%
group_by(habitat) %>%
summarise(
total_abundance = sum(abundance),
n_species = n_distinct(morphospecies),
n_sites = n_distinct(site),
mean_abundance = mean(abundance),
.groups = "drop" # Clean up grouping after summarise
)
# Multiple grouping variables
insect_data %>%
group_by(habitat, order) %>%
summarise(
total_abundance = sum(abundance),
n_species = n_distinct(morphospecies),
.groups = "drop"
)Ecological Data Formats:
Long format: One row per observation (site-species-abundance). Good for: data entry, plotting, dplyr operations.
Wide format (Community Matrix): Rows = sites, Columns = species. Required for: vegan package, ordination, diversity indices.
Most raw data is long; most analyses need wide. Youβll convert between them constantly!
# LONG TO WIDE: Create a community matrix
species_wide <- insect_data %>%
group_by(site, morphospecies) %>%
summarise(abundance = sum(abundance), .groups = "drop") %>%
pivot_wider(
names_from = morphospecies, # Species become columns
values_from = abundance, # Cell values are abundances
values_fill = 0 # Missing = 0 (species absent)
)
head(species_wide)
# WIDE TO LONG: For plotting
species_long <- species_wide %>%
pivot_longer(
cols = -site, # All columns except site
names_to = "species", # Column names go to "species"
values_to = "abundance" # Values go to "abundance"
)π½οΈ LUNCH: 11:30 - 12:30
Time: 12:30 - 13:45 (75 minutes)
βGarbage In, Garbage Outβ
Taxonomic data is notoriously messy. Common problems:
Your cleaning decisions affect your conclusions. Document what you did!
# Check for problems first
unique(insect_data$habitat) # Look for typos, case issues
# Common cleaning operations
clean_data <- insect_data %>%
mutate(
# Standardize text: lowercase and remove extra spaces
habitat = tolower(trimws(habitat)),
order = tolower(trimws(order)),
# Fix specific typos (if any exist)
habitat = case_when(
habitat == "forrest" ~ "forest", # Common typo
habitat == "grasland" ~ "grassland",
TRUE ~ habitat # Keep all others as-is
)
) %>%
# Remove impossible or missing values
filter(
abundance >= 0, # No negative abundances
!is.na(abundance), # No missing abundances
!is.na(morphospecies) # No missing species IDs
)
# Verify cleaning worked
unique(clean_data$habitat)What is a Community Matrix?
The community matrix is the fundamental data structure in community ecology:
Species_A Species_B Species_C Species_D
Site_1 12 5 0 8
Site_2 0 23 15 2
Site_3 7 0 0 45
This matrix is what ordination and diversity analyses use!
# For this workshop, we'll focus on beetles (Coleoptera)
# We'll learn HOW to choose focal taxa in Session 4
comm_matrix <- insect_data %>%
filter(order == "Coleoptera") %>% # Focus on beetles
group_by(site, morphospecies) %>%
summarise(abundance = sum(abundance), .groups = "drop") %>%
pivot_wider(
names_from = morphospecies,
values_from = abundance,
values_fill = 0 # CRITICAL: Missing species = 0, not NA
) %>%
column_to_rownames("site") # Site names become row names
# Check the result
dim(comm_matrix) # Sites Γ Species
head(comm_matrix[, 1:5]) # First 5 speciesWhy values_fill = 0?
In ecology, if we didnβt record a species at a site, we assume it was absent (abundance = 0), not βunknownβ (NA).
This is an important assumption! It assumes: - Equal sampling effort across sites - Species detection is reliable
If sampling effort varied, you should use rarefaction (Session 5).
# Environmental data: one row per site with habitat info
env_data <- insect_data %>%
select(site, habitat, landscape) %>%
distinct() %>% # One row per site
column_to_rownames("site") # Site names as row names
# CRITICAL: Ensure rows match between matrices!
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]
# rownames(comm_matrix): Extracts the unique IDs (like "S01", "S02") from your insect data.
# env_data[..., ]: Selects and reorders rows in the environmental table based on those IDs
# , (The blank space): Tells R to keep all columns (e.g., habitat, elevation, distance).
# drop = FALSE -- A "defensive" command. It ensures that even if you only have one environmental variable, R keeps the data as a data frame instead of turning it into a simple list (vector)..
# Verify alignment (MUST be TRUE!)
all(rownames(comm_matrix) == rownames(env_data))β BREAK: 13:45 - 14:00
Time: 14:00 - 15:00 (60 minutes)
The Central Question of This Session:
Your pitfall traps caught multiple insect orders (Coleoptera, Hymenoptera, Araneae, etc.). Which groups should you analyze in detail?
Not all groups are equal! Some have ecological stories to tell; others are just noise.
Why Not Analyze Everything?
Good focal taxa have:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DECISION FLOWCHART: CHOOSING FOCAL INSECT GROUPS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
For each insect ORDER in your data:
β
βΌ
βββββββββββββββββββββββββββββββββ
β CRITERION 1: ABUNDANCE β
β Total individuals β₯ 50? β
βββββββββββββββββ¬ββββββββββββββββ
β
NO βββββββββ΄ββββββββΊ YES
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββββββββββββββββ
β EXCLUDE β β CRITERION 2: DIVERSITY β
β Too few data β β Number of morphospecies β₯ 5? β
β for statistics β βββββββββββββββββ¬ββββββββββββββββ
βββββββββββββββββββ β
NO βββββββββ΄ββββββββΊ YES
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββββββββββββββββ
β POSSIBLE but β β CRITERION 3: OCCURRENCE β
β Limited β β Present in β₯ 50% of sites? β
β (can only β βββββββββββββββββ¬ββββββββββββββββ
β analyze abund- β β
β ance, not β NO βββββββββ΄ββββββββΊ YES
β diversity) β β β
βββββββββββββββββββ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββββββββββββββββ
β CAUTION β β CRITERION 4: VARIATION β
β May be habitat β β Shows differences between β
β specialist or β β habitats? (CV > 30%) β
β rare β βββββββββββββββββ¬ββββββββββββββββ
βββββββββββββββββββ β
NO βββββββββ΄ββββββββΊ YES
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β POSSIBLE β β EXCELLENT β
β May be β β FOCAL TAXON! β
β generalist β β β
β (still β β High priority β
β interesting!) β β for analysis β
βββββββββββββββββββ βββββββββββββββββββ
# What do we have?
cat("=== DATASET OVERVIEW ===\n")
cat("Total records:", nrow(insect_data), "\n")
cat("Total individuals:", sum(insect_data$abundance), "\n")
cat("Number of sites:", n_distinct(insect_data$site), "\n")
cat("Number of habitats:", n_distinct(insect_data$habitat), "\n")
cat("Number of orders:", n_distinct(insect_data$order), "\n")
cat("Number of morphospecies:", n_distinct(insect_data$morphospecies), "\n")
# How are individuals distributed across orders?
insect_data %>%
group_by(order) %>%
summarise(
total_abundance = sum(abundance),
percent_of_total = round(sum(abundance) / sum(insect_data$abundance) * 100, 1)
) %>%
arrange(desc(total_abundance))# Calculate all criteria for each order
order_evaluation <- insect_data %>%
group_by(order) %>%
summarise(
# Criterion 1: Abundance
total_abundance = sum(abundance),
# Criterion 2: Diversity
n_morphospecies = n_distinct(morphospecies),
# Criterion 3: Occurrence
n_sites_present = n_distinct(site[abundance > 0]),
pct_sites = round(n_distinct(site[abundance > 0]) / n_distinct(insect_data$site) * 100, 0),
.groups = "drop"
) %>%
arrange(desc(total_abundance))
print(order_evaluation)
# Visualize: Which orders are most abundant?
ggplot(order_evaluation, aes(x = reorder(order, total_abundance),
y = total_abundance)) +
geom_col(fill = "steelblue") +
geom_text(aes(label = paste0("n=", n_morphospecies, " spp")),
hjust = -0.1, size = 3) +
coord_flip() +
labs(
x = "Order",
y = "Total Abundance",
title = "Which Insect Orders Dominate Your Samples?",
subtitle = "Numbers show morphospecies count"
) +
theme_bw()# For Criterion 4: Do abundances vary between habitats?
habitat_variation <- insect_data %>%
group_by(order, habitat) %>%
summarise(abundance = sum(abundance), .groups = "drop") %>%
group_by(order) %>%
summarise(
# Coefficient of Variation: higher = more variation between habitats
cv_abundance = round(sd(abundance) / mean(abundance) * 100, 1),
# Which habitat has most individuals?
dominant_habitat = habitat[which.max(abundance)],
# How much more in best vs worst habitat?
max_min_ratio = round(max(abundance) / (min(abundance) + 1), 1),
.groups = "drop"
)
print(habitat_variation)Interpreting Coefficient of Variation (CV):
Both patterns are ecologically interesting! Generalists tell us habitats are functionally similar for that group; specialists show habitat filtering.
# Combine criteria and score each order
focal_taxa_scores <- order_evaluation %>%
left_join(habitat_variation, by = "order") %>%
mutate(
# Score each criterion (0-3 points each)
score_abundance = case_when(
total_abundance >= 100 ~ 3,
total_abundance >= 50 ~ 2,
total_abundance >= 30 ~ 1,
TRUE ~ 0
),
score_diversity = case_when(
n_morphospecies >= 10 ~ 3,
n_morphospecies >= 5 ~ 2,
n_morphospecies >= 3 ~ 1,
TRUE ~ 0
),
score_occurrence = case_when(
pct_sites >= 80 ~ 3,
pct_sites >= 50 ~ 2,
pct_sites >= 30 ~ 1,
TRUE ~ 0
),
score_variation = case_when(
cv_abundance >= 50 ~ 3,
cv_abundance >= 30 ~ 2,
cv_abundance >= 15 ~ 1,
TRUE ~ 0
),
# Total score (max = 12)
total_score = score_abundance + score_diversity + score_occurrence + score_variation,
# Final recommendation
recommendation = case_when(
total_score >= 10 ~ "β
β
β
Highly recommended",
total_score >= 7 ~ "β
β
Recommended",
total_score >= 4 ~ "β
Possible with caution",
TRUE ~ "β Not recommended"
)
) %>%
arrange(desc(total_score))
# Display decision table
cat("\n========================================\n")
cat(" FOCAL TAXA RECOMMENDATION TABLE\n")
cat("========================================\n\n")
focal_taxa_scores %>%
select(order, total_abundance, n_morphospecies, pct_sites,
cv_abundance, total_score, recommendation) %>%
print(n = Inf)# Create visual summary of recommendations
ggplot(focal_taxa_scores, aes(x = reorder(order, total_score),
y = total_score,
fill = recommendation)) +
geom_col() +
geom_hline(yintercept = c(4, 7, 10), linetype = "dashed", alpha = 0.5) +
coord_flip() +
scale_fill_manual(values = c(
"β
β
β
Highly recommended" = "#27ae60",
"β
β
Recommended" = "#3498db",
"β
Possible with caution" = "#f39c12",
"β Not recommended" = "#95a5a6"
)) +
labs(
x = "Order",
y = "Suitability Score (max = 12)",
title = "Which Insect Groups Should You Analyze?",
subtitle = "Based on abundance, diversity, occurrence, and habitat variation",
fill = "Recommendation"
) +
theme_bw() +
theme(legend.position = "bottom")Common Pitfall Trap Groups and Their Ecological Value:
| Group | Pitfall Suitability | Ecological Role | Indicator Value |
|---|---|---|---|
| Carabidae (ground beetles) | β β β Excellent | Predators, seed eaters | High - respond to disturbance, moisture |
| Formicidae (ants) | β β β Excellent | Ecosystem engineers | High - sensitive to temperature, disturbance |
| Araneae (spiders) | β β Good | Predators | High - indicate prey availability |
| Staphylinidae (rove beetles) | β β Good | Predators, decomposers | Moderate - very diverse |
| Scarabaeidae (dung beetles) | β β Good | Decomposers | High - depend on mammal presence |
| Orthoptera (grasshoppers) | β Moderate | Herbivores | Moderate - vegetation-dependent |
| Diptera (flies) | β Poor | Various | Low - mostly fly away from traps |
| Lepidoptera (moths) | β Poor | Herbivores, pollinators | Low - rarely caught in pitfalls |
# Once you've chosen a focal group (e.g., Coleoptera), preview the patterns
# Create community matrix for chosen group
focal_matrix <- insect_data %>%
filter(order == "Coleoptera") %>% # Replace with your chosen order
group_by(site, morphospecies) %>%
summarise(abundance = sum(abundance), .groups = "drop") %>%
pivot_wider(names_from = morphospecies, values_from = abundance, values_fill = 0) %>%
column_to_rownames("site")
# Get environmental data
focal_env <- insect_data %>%
select(site, habitat, landscape) %>%
distinct() %>%
column_to_rownames("site")
focal_env <- focal_env[rownames(focal_matrix), ]
# Quick NMDS preview (we'll learn this properly tomorrow!)
library(vegan)
set.seed(123)
nmds_preview <- metaMDS(focal_matrix, distance = "bray", k = 2, trymax = 50)
# Plot preview
nmds_scores <- as.data.frame(scores(nmds_preview, display = "sites"))
nmds_scores$habitat <- focal_env$habitat
ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, color = habitat)) +
geom_point(size = 4) +
stat_ellipse(linetype = 2) +
labs(
title = "Preview: Do Beetle Communities Differ by Habitat?",
subtitle = paste("NMDS stress =", round(nmds_preview$stress, 3),
"- We'll interpret this properly tomorrow!"),
color = "Habitat"
) +
scale_color_brewer(palette = "Set1") +
theme_bw() +
coord_equal()End of Day 1 Checkpoint:
Before tomorrow, you should have:
Objects weβll use tomorrow:
insect_data - The raw datacomm_matrix - Community matrix for your focal
groupenv_data - Environmental data matching the community
matrixTime: 09:00 - 10:00 (60 minutes)
# Load packages
library(tidyverse)
library(vegan)
library(ape)
set.seed(123)
# If starting fresh, recreate from Day 1:
insect_data <- read.csv("data/raw/sample_insect_data.csv")
# Community matrix (using beetles as our focal group)
comm_matrix <- insect_data %>%
filter(order == "Coleoptera") %>%
group_by(site, morphospecies) %>%
summarise(abundance = sum(abundance), .groups = "drop") %>%
pivot_wider(names_from = morphospecies, values_from = abundance, values_fill = 0) %>%
column_to_rownames("site")
# Environmental data
env_data <- insect_data %>%
select(site, habitat, landscape) %>%
distinct() %>%
column_to_rownames("site")
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]
# Verify alignment
stopifnot(all(rownames(comm_matrix) == rownames(env_data)))
cat("β Data loaded successfully!\n")
cat(" Community matrix:", nrow(comm_matrix), "sites Γ", ncol(comm_matrix), "species\n")Why Do We Need Multiple Diversity Metrics?
Consider two beetle communities, each with 100 individuals and 10 species:
Community A (Uneven):
Species 1: 91 individuals ββββββββββββββββββββββββββββββββββββββββ
Species 2: 1 individual β
Species 3: 1 individual β
... (7 more species with 1 each)
Community B (Even):
Species 1: 10 individuals ββββ
Species 2: 10 individuals ββββ
Species 3: 10 individuals ββββ
... (7 more species with 10 each)
Both have richness = 10, but theyβre very different! - Community A is dominated by one species (low evenness) - Community B is balanced (high evenness)
Different diversity indices capture different aspects of this structure.
# SPECIES RICHNESS (S)
# Simply: How many species are present?
richness <- specnumber(comm_matrix)
# SHANNON DIVERSITY (H')
# Combines richness AND evenness
# Higher = more diverse OR more even
shannon <- diversity(comm_matrix, index = "shannon")
# SIMPSON DIVERSITY (1-D)
# Probability that two random individuals are different species
# Higher = more diverse
simpson <- diversity(comm_matrix, index = "simpson")
# EVENNESS (Pielou's J)
# How evenly are individuals distributed among species?
# J = H' / ln(S)
# J = 1 means perfectly even; J β 0 means one species dominates
evenness <- shannon / log(richness)
# Compile results
alpha_diversity <- data.frame(
site = rownames(comm_matrix),
richness = richness,
shannon = round(shannon, 3),
simpson = round(simpson, 3),
evenness = round(evenness, 3)
) %>%
left_join(env_data %>% rownames_to_column("site"), by = "site")
print(alpha_diversity)When to Use Each Index:
| Index | Best For | Interpretation |
|---|---|---|
| Richness | Simple comparisons, presence/absence data | βHow many species?β |
| Shannon | General diversity, most publications | βHow uncertain is species identity?β |
| Simpson | When dominance matters | βWhatβs the chance two individuals differ?β |
| Evenness | Understanding community structure | βIs one species dominating?β |
For most ecological studies, report at least richness AND Shannon.
The Sampling Problem:
You collected 500 beetles at Site A (found 25 species) and 100 beetles at Site B (found 15 species).
Is Site A really more diverse? Or did you just sample more?
Rarefaction asks: βIf I had caught only 100 beetles at Site A, how many species would I expect?β
# Check sample sizes (total individuals per site)
sample_sizes <- rowSums(comm_matrix)
cat("Sample sizes per site:\n")
print(sample_sizes)
cat("\nRange:", min(sample_sizes), "-", max(sample_sizes), "individuals\n")
# If sample sizes differ greatly, rarefaction is important!
if (max(sample_sizes) / min(sample_sizes) > 2) {
cat("β οΈ Sample sizes vary by more than 2x. Rarefaction recommended!\n")
}
# Rarefy to smallest sample size
min_n <- min(sample_sizes)
rarefied_richness <- rarefy(comm_matrix, sample = min_n)
# Compare raw vs rarefied
comparison <- data.frame(
site = rownames(comm_matrix),
raw_richness = richness,
sample_size = sample_sizes,
rarefied_richness = round(rarefied_richness, 1),
difference = round(richness - rarefied_richness, 1)
)
print(comparison)
# Rarefaction curves
rarecurve(comm_matrix,
step = 5,
col = rainbow(nrow(comm_matrix)),
lwd = 2,
xlab = "Number of Individuals Sampled",
ylab = "Species Richness",
main = "Rarefaction Curves: Are Sites Adequately Sampled?")
abline(v = min_n, lty = 2, col = "gray")
legend("bottomright", legend = rownames(comm_matrix),
col = rainbow(nrow(comm_matrix)), lwd = 2, cex = 0.7)Reading Rarefaction Curves:
If curves are still steep at your sample sizes, you probably havenβt captured the full community!
# Visualize diversity by habitat
ggplot(alpha_diversity, aes(x = habitat, y = shannon, fill = habitat)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.1, size = 3, alpha = 0.7) +
scale_fill_brewer(palette = "Set2") +
labs(
x = "Habitat",
y = "Shannon Diversity (H')",
title = "Does Beetle Diversity Differ Between Habitats?"
) +
theme_bw() +
theme(legend.position = "none")Why Kruskal-Wallis Test?
We use Kruskal-Wallis instead of ANOVA because:
Kruskal-Wallis tests whether the RANKS differ between groups.
# Statistical test: Do habitats differ in diversity?
kruskal_result <- kruskal.test(shannon ~ habitat, data = alpha_diversity)
print(kruskal_result)
# Interpretation
cat("\n=== INTERPRETATION ===\n")
if (kruskal_result$p.value < 0.05) {
cat("β Significant difference in Shannon diversity between habitats (p =",
round(kruskal_result$p.value, 4), ")\n")
# Post-hoc pairwise tests
cat("\nPairwise comparisons:\n")
pairwise.wilcox.test(alpha_diversity$shannon, alpha_diversity$habitat,
p.adjust.method = "bonferroni")
} else {
cat("β No significant difference detected (p =",
round(kruskal_result$p.value, 4), ")\n")
cat(" This doesn't mean habitats are equal - we may lack statistical power.\n")
}β BREAK: 10:00 - 10:15
Time: 10:15 - 11:15 (60 minutes)
The Challenge of Multivariate Data:
Your community matrix has many species (columns). Each site exists in a high-dimensional βspecies space.β
Ordination reduces these many dimensions to 2-3 that we can visualize, while preserving the important patterns.
Key insight: Sites that are close together in an ordination plot have similar species compositions. Sites far apart are different.
# First, we calculate how DIFFERENT each pair of sites is
# Bray-Curtis dissimilarity is the standard for abundance data
dist_bray <- vegdist(comm_matrix, method = "bray")
# View as matrix (first few sites)
cat("Bray-Curtis Dissimilarity Matrix:\n")
print(round(as.matrix(dist_bray)[1:4, 1:4], 3))Why Bray-Curtis Dissimilarity?
| Dissimilarity | Formula Idea | Best For |
|---|---|---|
| Bray-Curtis | Compares shared vs total abundance | Abundance data (most common) |
| Jaccard | Compares shared vs total species | Presence/absence data |
| Euclidean | Straight-line distance | Continuous environmental data |
Bray-Curtis is preferred because:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ORDINATION METHOD DECISION TREE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
What type of data do you have?
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β Abundanceβ β Presence/β β Environ- β
β data β β Absence β β mental β
ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ
β β β
βΌ βΌ βΌ
Bray-Curtis Jaccard Consider PCA
distance distance (different method)
β β
ββββββββββββ¬ββββββββββ
β
βΌ
What matters most for your question?
β
ββββββββββββββ΄βββββββββββββ
β β
βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
β Preserving β β Knowing % β
β RELATIVE β β VARIANCE β
β positions β β explained β
β (which sites β β by each axis β
β are similar) β β β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β
βΌ βΌ
ββββββββββ ββββββββββ
β NMDS β β PCoA β
ββββββββββ ββββββββββ
β β
βΌ βΌ
"The ecology "Microbiome
standard" standard"
Most published Reports %
ecology studies variance
Quick Decision Guide:
| Feature | NMDS | PCoA |
|---|---|---|
| What it preserves | Rank order of distances | Actual distances |
| Axes meaning | Arbitrary (no % variance) | % variance explained |
| Reproducibility | May vary (use set.seed!) | Always the same |
| Best for | Community ecology | When % variance matters |
| Stress/Quality | Report stress value | Report eigenvalues |
For most ecological studies: Use NMDS. Itβs the community standard and handles ecological data well.
Use PCoA when: You need to report % variance explained, or for microbiome data or when your data mostly consist of singletons or 0.
# Run NMDS
set.seed(123) # For reproducibility - NMDS uses random starts!
nmds_result <- metaMDS(
comm_matrix,
distance = "bray", # Bray-Curtis dissimilarity
k = 2, # 2 dimensions for visualization
trymax = 100 # Try 100 random starting positions
)
# Check the stress value
cat("\n=== NMDS RESULTS ===\n")
cat("Stress value:", round(nmds_result$stress, 4), "\n")Interpreting NMDS Stress:
Stress measures how well the 2D plot represents the actual multi-dimensional distances.
| Stress | Quality | Interpretation |
|---|---|---|
| < 0.05 | Excellent | Almost perfect representation |
| 0.05 - 0.10 | Good | Safe to interpret |
| 0.10 - 0.20 | Acceptable | Interpret with some caution |
| 0.20 - 0.30 | Poor | Distances may be misleading |
| > 0.30 | Very poor | Do not interpret! |
Always report stress in publications!
# Interpret stress value
if (nmds_result$stress < 0.10) {
cat("β Good fit! The 2D plot accurately represents community distances.\n")
} else if (nmds_result$stress < 0.20) {
cat("β Acceptable fit. Interpret patterns with some caution.\n")
} else {
cat("β Poor fit. Consider using 3 dimensions (k=3) or different approach.\n")
}
# Stressplot shows how well ordination distances match original distances
stressplot(nmds_result, main = "Shepard Diagram: How Well Does NMDS Fit?")# Extract site scores
nmds_scores <- as.data.frame(scores(nmds_result, display = "sites"))
nmds_scores$site <- rownames(nmds_scores)
# Add environmental data
nmds_scores <- nmds_scores %>%
left_join(env_data %>% rownames_to_column("site"), by = "site")
# Create plot
p_nmds <- ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2,
color = habitat, shape = landscape)) +
geom_point(size = 4, alpha = 0.8) +
stat_ellipse(aes(group = habitat), linetype = 2, level = 0.95) +
scale_color_brewer(palette = "Set1") +
labs(
title = "NMDS: Beetle Community Composition",
subtitle = paste("Stress =", round(nmds_result$stress, 3)),
color = "Habitat",
shape = "Landscape"
) +
theme_bw() +
coord_equal() # IMPORTANT: Equal scaling on both axes!
print(p_nmds)Reading NMDS Plots:
# Run PCoA
pcoa_result <- pcoa(dist_bray)
# Check variance explained
eigenvalues <- pcoa_result$values$Eigenvalues
var_explained <- eigenvalues / sum(abs(eigenvalues)) * 100
cat("\n=== PCoA VARIANCE EXPLAINED ===\n")
cat("Axis 1:", round(var_explained[1], 1), "%\n")
cat("Axis 2:", round(var_explained[2], 1), "%\n")
cat("Axes 1+2:", round(sum(var_explained[1:2]), 1), "%\n")
# Extract scores and create plot
pcoa_scores <- as.data.frame(pcoa_result$vectors[, 1:2])
colnames(pcoa_scores) <- c("PCoA1", "PCoA2")
pcoa_scores$site <- rownames(pcoa_scores)
pcoa_scores <- pcoa_scores %>%
left_join(env_data %>% rownames_to_column("site"), by = "site")
p_pcoa <- ggplot(pcoa_scores, aes(x = PCoA1, y = PCoA2, color = habitat)) +
geom_point(size = 4, alpha = 0.8) +
stat_ellipse(linetype = 2) +
scale_color_brewer(palette = "Set1") +
labs(
title = "PCoA: Beetle Community Composition",
x = paste0("PCoA1 (", round(var_explained[1], 1), "%)"),
y = paste0("PCoA2 (", round(var_explained[2], 1), "%)"),
color = "Habitat"
) +
theme_bw() +
coord_equal()
print(p_pcoa)π½οΈ LUNCH: 11:15 - 12:15
Time: 12:15 - 13:30 (75 minutes)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMMUNITY ANALYSIS WORKFLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 1: VISUALIZE (NMDS/PCoA) β
β "Do the groups LOOK different?" β
β If no visible pattern, statistical test will likely β
β be non-significant too. β
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 2: TEST SIGNIFICANCE (PERMANOVA) β
β "Is the difference STATISTICALLY SIGNIFICANT?" β
β p < 0.05 means unlikely to be due to chance. β
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 3: CHECK EFFECT SIZE (RΒ²) β
β "HOW MUCH of the variation does habitat explain?" β
β A significant p-value with tiny RΒ² = trivial effect. β
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 4: CHECK ASSUMPTIONS (betadisper) β
β "Is the test valid?" β
β PERMANOVA assumes groups have similar spread. β
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 5: IDENTIFY KEY SPECIES (SIMPER) β
β "WHICH species drive the differences?" β
β Connects statistics back to biology. β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why PERMANOVA Instead of ANOVA?
PERMANOVA = Permutational Multivariate Analysis of Variance
It works by:
# PERMANOVA: Do beetle communities differ between habitats?
permanova_result <- adonis2(
comm_matrix ~ habitat * landscape, # Test habitat, landscape, and interaction
data = env_data,
method = "bray", # Bray-Curtis dissimilarity
permutations = 999 # 999 permutations for p-value
)
print(permanova_result)cat("\n=== PERMANOVA INTERPRETATION ===\n\n")
# Extract key values
r2_habitat <- permanova_result$R2[1]
f_habitat <- permanova_result$F[1]
p_habitat <- permanova_result$`Pr(>F)`[1]
cat("HABITAT EFFECT:\n")
cat(" F-value:", round(f_habitat, 2), "\n")
cat(" RΒ² =", round(r2_habitat, 3),
"(", round(r2_habitat * 100, 1), "% of variation explained)\n")
cat(" p-value:", p_habitat, "\n\n")
# Significance interpretation
if (p_habitat < 0.001) {
cat(" β
Highly significant (p < 0.001)\n")
} else if (p_habitat < 0.01) {
cat(" β
Very significant (p < 0.01)\n")
} else if (p_habitat < 0.05) {
cat(" β
Significant (p < 0.05)\n")
} else {
cat(" β Not significant (p β₯ 0.05)\n")
}
# Effect size interpretation
cat("\nEFFECT SIZE INTERPRETATION:\n")
if (r2_habitat < 0.05) {
cat(" Effect is TINY - statistically significant but ecologically trivial\n")
} else if (r2_habitat < 0.10) {
cat(" Effect is SMALL - habitat explains a minor portion\n")
} else if (r2_habitat < 0.25) {
cat(" Effect is MEDIUM - habitat is an important factor\n")
} else {
cat(" Effect is LARGE - habitat is a major driver of community structure\n")
}Effect Size (RΒ²) Guide:
| RΒ² | Interpretation | Example |
|---|---|---|
| < 0.05 | Tiny | Habitat explains less than 5% of variation |
| 0.05-0.10 | Small | Habitat is one of many factors |
| 0.10-0.25 | Medium | Habitat is an important driver |
| > 0.25 | Large | Habitat is a dominant factor |
Always report both p-value AND RΒ²! A significant p-value with tiny RΒ² means the effect is real but not very important.
Why Check Dispersion?
PERMANOVA assumes that groups have similar within-group variation (dispersion).
If one habitat has highly variable communities and another is very consistent, PERMANOVA might detect this as a βdifferenceβ even if the average compositions are identical!
betadisper tests this assumption.
# Test for homogeneity of dispersion
dispersion <- betadisper(dist_bray, env_data$habitat)
dispersion_test <- permutest(dispersion, permutations = 999)
cat("\n=== DISPERSION TEST ===\n")
print(dispersion_test)
# Interpretation
if (dispersion_test$tab$`Pr(>F)`[1] > 0.05) {
cat("\nβ Dispersions are homogeneous (p > 0.05).\n")
cat(" PERMANOVA assumption is met - results are reliable.\n")
} else {
cat("\nβ Dispersions are UNEQUAL (p < 0.05)!\n")
cat(" PERMANOVA results may be influenced by dispersion differences.\n")
cat(" Interpret with caution.\n")
}
# Visualize dispersion
boxplot(dispersion, main = "Distance to Group Centroid by Habitat",
xlab = "Habitat", ylab = "Distance to centroid")SIMPER = Similarity Percentages
PERMANOVA tells us communities differ, but not WHY.
SIMPER identifies which species contribute most to the differences.
This connects statistics back to biology!
# SIMPER analysis
simper_result <- simper(comm_matrix, env_data$habitat, permutations = 999)
cat("\n=== SIMPER RESULTS ===\n")
cat("Top species contributing to habitat differences:\n\n")
summary(simper_result)Reading SIMPER Output:
| Column | Meaning |
|---|---|
| average | Average contribution to dissimilarity |
| sd | Standard deviation of contribution |
| ratio | average/sd - consistency of contribution |
| cumsum | Cumulative sum (% explained so far) |
Key indicators:
β BREAK: 13:30 - 13:45
Time: 13:45 - 14:45 (60 minutes)
For Every Statistical Result, Ask:
Example Finding: Beetle communities differ significantly between habitats (PERMANOVA: RΒ² = 0.32, p = 0.001, stress = 0.12)
1. WHAT did we find?
Beetle species composition differs between forest, grassland, and agricultural habitats. About 32% of the total variation in beetle communities is explained by habitat type alone.
2. HOW BIG is the effect?
This is a large effect (RΒ² > 0.25). Habitat is a major driver of beetle community structure. The NMDS stress of 0.12 indicates the pattern is reliably represented in 2D.
3. WHY might this happen? (Ecological mechanisms)
4. SO WHAT? (Implications)
5. WHAT NEXT? (Limitations and new questions)
Limitations: - Single sampling season (seasonal turnover not captured) - Pitfall traps favor ground-active species (canopy beetles missed) - Morphospecies identification may underestimate true diversity - No data on habitat quality within types
New questions: - Which specific environmental variables (temperature, leaf litter depth, soil moisture) drive beetle distributions? - Does forest fragment size affect beetle community composition? - How quickly do beetle communities recover after habitat restoration? - Are there βspilloverβ effects at habitat edges?
# Create results summary table
results_summary <- data.frame(
Analysis = c(
"Alpha Diversity (Shannon)",
"NMDS Ordination",
"PERMANOVA (Habitat)",
"PERMANOVA (Landscape)",
"Dispersion Test"
),
Statistic = c(
paste("Kruskal ΟΒ² =", round(kruskal_result$statistic, 2)),
paste("Stress =", round(nmds_result$stress, 3)),
paste("F =", round(permanova_result$F[1], 2)),
paste("F =", round(permanova_result$F[2], 2)),
paste("F =", round(dispersion_test$tab$F[1], 2))
),
Effect_Size = c(
"See boxplot",
"Visual separation",
paste("RΒ² =", round(permanova_result$R2[1], 3)),
paste("RΒ² =", round(permanova_result$R2[2], 3)),
"NA"
),
P_value = c(
round(kruskal_result$p.value, 4),
"NA",
permanova_result$`Pr(>F)`[1],
permanova_result$`Pr(>F)`[2],
round(dispersion_test$tab$`Pr(>F)`[1], 4)
),
Interpretation = c(
ifelse(kruskal_result$p.value < 0.05, "Significant", "Not significant"),
ifelse(nmds_result$stress < 0.2, "Good fit", "Poor fit"),
ifelse(permanova_result$`Pr(>F)`[1] < 0.05, "Communities differ", "No difference"),
ifelse(permanova_result$`Pr(>F)`[2] < 0.05, "Effect of landscape", "No effect"),
ifelse(dispersion_test$tab$`Pr(>F)`[1] > 0.05, "Assumption met", "Assumption violated")
)
)
cat("\nβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n")
cat("β RESULTS SUMMARY TABLE β\n")
cat("βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n\n")
print(results_summary, row.names = FALSE)# Create comprehensive figure
p_final <- ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2,
color = habitat, shape = landscape)) +
geom_point(size = 5, alpha = 0.8) +
stat_ellipse(aes(group = habitat), linetype = 2, linewidth = 1) +
scale_color_brewer(palette = "Set1") +
scale_shape_manual(values = c(16, 17)) +
labs(
title = "Beetle Community Composition Across Habitats",
subtitle = sprintf("NMDS stress = %.3f | PERMANOVA: RΒ² = %.2f, p = %.3f",
nmds_result$stress,
permanova_result$R2[1],
permanova_result$`Pr(>F)`[1]),
color = "Habitat",
shape = "Landscape"
) +
theme_bw(base_size = 14) +
theme(
legend.position = "right",
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(size = 10)
) +
coord_equal()
print(p_final)
# Save
ggsave("figures/nmds_final.png", p_final, width = 10, height = 8, dpi = 300)#============================================================================
# COMPLETE ANALYSIS SCRIPT TEMPLATE
# Copy and modify for your own analyses
#============================================================================
# 1. SETUP ----
rm(list = ls())
library(tidyverse)
library(vegan)
library(ape)
set.seed(42)
# 2. IMPORT DATA ----
raw_data <- read.csv("data/raw/sample_insect_data.csv")
# 3. EXPLORE AND CHOOSE FOCAL TAXA ----
# [Run decision framework from Session 4]
# 4. PREPARE DATA ----
comm_matrix <- raw_data %>%
filter(order == "Coleoptera") %>% # Your chosen focal group
group_by(site, morphospecies) %>%
summarise(abundance = sum(abundance), .groups = "drop") %>%
pivot_wider(names_from = morphospecies, values_from = abundance, values_fill = 0) %>%
column_to_rownames("site")
env_data <- raw_data %>%
select(site, habitat, landscape) %>%
distinct() %>%
column_to_rownames("site")
env_data <- env_data[rownames(comm_matrix), ]
# 5. ALPHA DIVERSITY ----
alpha_div <- data.frame(
site = rownames(comm_matrix),
richness = specnumber(comm_matrix),
shannon = diversity(comm_matrix, "shannon")
) %>%
left_join(env_data %>% rownames_to_column("site"), by = "site")
kruskal.test(shannon ~ habitat, data = alpha_div)
# 6. BETA DIVERSITY ----
dist_bray <- vegdist(comm_matrix, method = "bray")
nmds <- metaMDS(comm_matrix, distance = "bray", k = 2, trymax = 100)
cat("NMDS Stress:", nmds$stress, "\n")
# 7. STATISTICAL TESTS ----
permanova <- adonis2(comm_matrix ~ habitat * landscape,
data = env_data, method = "bray", permutations = 999)
print(permanova)
betadisper(dist_bray, env_data$habitat) %>% permutest()
simper(comm_matrix, env_data$habitat)
# 8. SAVE OUTPUTS ----
write.csv(alpha_div, "output/alpha_diversity.csv", row.names = FALSE)
ggsave("figures/nmds_plot.png", width = 8, height = 6, dpi = 300)Time: 14:45 - 15:00 (15 minutes)
Day 1:
Day 2:
Books:
Online:
Packages to explore:
BiodiversityR - More biodiversity analysis toolsindicspecies - Indicator species analysisiNEXT - Diversity extrapolationββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β R FOR INSECT ECOLOGY - QUICK REFERENCE β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β DATA WRANGLING (dplyr) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β filter(data, condition) Keep rows matching condition β
β select(data, columns) Keep selected columns β
β mutate(data, new_col = ...) Create new columns β
β group_by() %>% summarise() Aggregate by groups β
β pivot_wider() / pivot_longer() Reshape data β
β β
β DIVERSITY METRICS (vegan) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β specnumber(x) Species richness β
β diversity(x, "shannon") Shannon H' β
β diversity(x, "simpson") Simpson 1-D β
β rarefy(x, sample = n) Rarefied richness β
β β
β ORDINATION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β vegdist(x, "bray") Bray-Curtis dissimilarity β
β metaMDS(x, k = 2) NMDS (report stress!) β
β pcoa(dist) PCoA (report % variance) β
β β
β STATISTICAL TESTS β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β adonis2(matrix ~ factor) PERMANOVA (report RΒ² AND p!) β
β betadisper(dist, group) Check dispersion assumption β
β simper(matrix, group) Identify key species β
β β
β INTERPRETATION GUIDE β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β NMDS Stress: < 0.10 good | 0.10-0.20 okay | > 0.20 poor β
β PERMANOVA RΒ²: < 0.05 tiny | 0.05-0.10 small | 0.10-0.25 med | > 0.25 large β
β β
β DECISION CHECKLIST FOR FOCAL TAXA β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β‘ Total abundance β₯ 50? β
β β‘ Number of species β₯ 5? β
β β‘ Present in β₯ 50% of sites? β
β β‘ Shows habitat variation (CV > 30%)? β
β β‘ Ecologically relevant for pitfall sampling? β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Happy coding and happy bug hunting! π