Workshop Overview

Schedule

╔══════════════════════════════════════════════════════════════════════════════╗
║  DAY 1: DATA FOUNDATIONS                         9:00 - 15:00                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  09:00 - 10:15  Session 1: Project Setup & R Essentials            75 min    ║
║  10:15 - 10:30  ☕ Break                                            15 min    ║
║  10:30 - 11:30  Session 2: Importing & Exploring Data              60 min    ║
║  11:30 - 12:30  🍽️ LUNCH                                            60 min   ║
║  12:30 - 13:45  Session 3: Data Wrangling with tidyverse           75 min    ║
║  13:45 - 14:00  ☕ Break                                            15 min    ║
║  14:00 - 15:00  Session 4: Data Exploration & Cleaning             60 min    ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  DAY 2: ANALYSIS & INTERPRETATION                9:00 - 15:00                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  09:00 - 10:00  Session 5: Visualization with ggplot2              60 min    ║
║  10:00 - 10:15  ☕ Break                                            15 min    ║
║  10:15 - 11:15  Session 6: Diversity Metrics                       60 min    ║
║  11:15 - 12:15  🍽️ LUNCH                                            60 min   ║
║  12:15 - 13:30  Session 7: Multivariate Analysis (NMDS)            75 min    ║
║  13:30 - 13:45  ☕ Break                                            15 min    ║
║  13:45 - 14:45  Session 8: Statistical Testing                     60 min    ║
║  14:45 - 15:00  Wrap-up & Next Steps                               15 min    ║
╚══════════════════════════════════════════════════════════════════════════════╝

Prerequisites

Before this workshop, you should have completed:

✅ Installed R and RStudio
✅ Installed required packages (tidyverse, vegan, ape, indicspecies)
✅ Created the workshop project with folder structure
✅ Completed the Pre-Workshop Preparation exercises

The Ecological Context

When we sample insects across different habitats or landscapes, we’re fundamentally asking:

“Do different environments support different insect communities, and why?”

This connects to ecological theory:

Niche theory: Species occur where conditions match their requirements
Habitat filtering: Environmental conditions “filter” which species persist
Dispersal limitation: Species can only occur where they can reach

Our statistical analyses detect these patterns and test hypotheses.

Analysis Workflow Overview

RAW DATA → CLEAN & PREPARE → EXPLORE → ANALYZE → INTERPRET
              │                  │         │
              │                  │         ├── Alpha diversity
              │                  │         ├── NMDS ordination
              │                  │         └── PERMANOVA
              │                  │
              │                  └── Which taxa to focus on?
              │
              └── Community matrix + Environmental data

DAY 1: DATA FOUNDATIONS

1 Session 1: Project Setup & R Essentials (75 min)

1.1 Opening Your Workshop Project

🔑 KEY CONCEPT: Always work within an R Project. Never use setwd().

1.1.1 Starting the Workshop

Open RStudio
Go to File → Open Project
Navigate to your Insect_Ecology_Workshop folder
Select the .Rproj file
Click Open

Verify you’re in the right place:

# Check your working directory
getwd()
# Should show: ".../Insect_Ecology_Workshop"

# List files in your project
list.files()
# Should show: data, scripts, output, figures folders

1.1.2 If You Haven’t Created the Project Yet

# Create folder structure
dir.create("data")
dir.create("data/raw")
dir.create("data/processed")
dir.create("scripts")
dir.create("output")
dir.create("figures")

# Verify
list.files()

1.2 Project Organization Best Practices

1.2.1 The Golden Structure

Insect_Ecology_Workshop/
│
├── Insect_Ecology_Workshop.Rproj   ← Always open THIS file!
│
├── data/
│   ├── raw/                        ← Original data (NEVER modify!)
│   │   ├── sample_insect_data.csv
│   │   └── pollinator_data.csv
│   └── processed/                  ← Cleaned data goes here
│       └── beetles_clean.csv
│
├── scripts/                        ← Your R scripts
│   ├── 01_data_import.R
│   ├── 02_exploration.R
│   ├── 03_diversity.R
│   └── 04_ordination.R
│
├── output/                         ← Tables, results
│   └── diversity_table.csv
│
└── figures/                        ← Saved plots
    ├── nmds_plot.png
    └── diversity_boxplot.pdf

1.2.2 Why This Matters

# ❌ BAD - Absolute paths that break on other computers:
setwd("C:/Users/John/Desktop/thesis/chapter2/data")
data <- read.csv("beetles.csv")

# ❌ BAD - Files scattered everywhere:
data <- read.csv("C:/Users/John/Downloads/beetles.csv")

# ✅ GOOD - Relative paths from project folder:
data <- read.csv("data/raw/beetles.csv")
# Works on ANY computer with the same folder structure!

# ✅ GOOD - Saving outputs to organized folders:
write.csv(results, "output/diversity_results.csv")
ggsave("figures/nmds_plot.png")

1.2.3 The Raw Data Rule

Never modify files in data/raw/!

Your raw data should remain exactly as you received it. If you need to clean or modify data:

Read from data/raw/
Clean in R
Save to data/processed/

# Read raw data
beetles_raw <- read.csv("data/raw/beetle_survey.csv")

# Clean it
beetles_clean <- beetles_raw %>%      # beetles_clean is the name of the new cleaned object.
  filter(!is.na(abundance)) %>%       
  # This function keeps only the rows that meet a certain condition-keeps only the rows where the abundance is not missing.
  mutate(habitat = tolower(habitat))  
  # This function is used to create a new column or modify an existing one.
  # tolower() function  converts all text in the `habitat` column to **lowercase**.

# Save cleaned version on processed folder
write.csv(beetles_clean, "data/processed/beetles_clean.csv", row.names = FALSE)

1.3 R Essentials Quick Review

1.3.1 Objects and Assignment

# Store values in objects using <-
n_sites <- 12
study_area <- "West Java"
is_dry_season <- TRUE

# View objects by typing their name
n_sites
study_area

# Keyboard shortcut for <- : Alt + - (Windows) or Option + - (Mac)

1.3.2 Functions

# Functions perform actions: function_name(arguments)
mean(c(10, 20, 30))        # Calculate mean
sum(c(1, 2, 3, 4, 5))      # Calculate sum
length(c(5, 10, 15))       # Count elements

# Functions can have multiple arguments
round(3.14159, digits = 2)  # Round to 2 decimal places

1.3.3 Vectors

# Vectors are collections of values
abundances <- c(12, 45, 23, 8, 67, 34)

# Operations on vectors
sum(abundances)            # 189
mean(abundances)           # 31.5
abundances * 2             # Doubles each element

# Indexing: extract elements with [ ]
abundances[1]              # First element: 12
abundances[c(1, 3, 5)]     # Elements 1, 3, 5
abundances[abundances > 30] # Elements greater than 30

1.3.4 Data Frames

# Data frames are like spreadsheets
beetle_data <- data.frame(
  site = c("A", "B", "C", "D"),
  habitat = c("forest", "forest", "grassland", "grassland"),
  abundance = c(45, 32, 67, 51)
)

# Access columns with $
beetle_data$abundance
beetle_data$habitat

# Access rows with [ row , column ]
beetle_data[1, ]           # First row
beetle_data[, 3]           # Third column
beetle_data[beetle_data$habitat == "forest", ]  # Forest rows only

1.4 Loading Packages

# Load packages at the start of EVERY session
library(tidyverse)    # Data wrangling and visualization
library(vegan)        # Community ecology
library(ape)          # PCoA

# If you get "there is no package called 'x'":
# install.packages("x")

1.5 Creating Your First Script

Create a new script for this workshop:

File → New File → R Script
Save as: scripts/01_day1_analysis.R

Script header template:

#============================================================================
# Insect Community Analysis - Day 1
# Author: [Your Name]
# Date: [Today's Date]
# Description: Import, clean, and explore insect survey data
#============================================================================

# Load packages ----
library(tidyverse)
library(vegan)

# Import data ----

# Data exploration ----

# Analysis ----

1.5.1 💡 Practice Exercise (5 min)

Create a new script called scripts/practice.R
Add a header with your name and today’s date
Load the tidyverse package
Create a vector called my_counts with values 5, 12, 8, 20, 15
Calculate the mean and save it to an object called avg_count
Save your script!

2 Session 2: Importing & Exploring Data (60 min)

2.1 Importing CSV Files

# Make sure your data file is in data/raw/
# Import with read.csv()
insect_data <- read.csv("data/raw/sample_insect_data.csv")

# Common options for messy data:
insect_data <- read.csv(
  "data/raw/sample_insect_data.csv",
  header = TRUE,                    # First row is column names
  stringsAsFactors = FALSE,         # Keep text as text
  na.strings = c("", "NA", "N/A")   # Recognize these as missing
)

2.2 First Look at Your Data

Always explore data immediately after importing!

# View the first few rows
head(insect_data)

# View in spreadsheet format (capital V!)
View(insect_data)

# Structure: column types and preview
str(insect_data)

# Dimensions: rows × columns
dim(insect_data)
nrow(insect_data)
ncol(insect_data)

# Column names
names(insect_data)

# Summary statistics
summary(insect_data)

2.3 Checking for Problems

# Missing values
sum(is.na(insect_data))           # Total NAs
colSums(is.na(insect_data))       # NAs per column

# Unique values in categorical columns
unique(insect_data$habitat)
unique(insect_data$order)

# Check for unexpected values
table(insect_data$habitat)        # Frequency table
table(insect_data$order)

# Check numeric ranges
range(insect_data$abundance)
summary(insect_data$abundance)

2.4 Basic Subsetting

# Filter rows by condition
forest_data <- insect_data[insect_data$habitat == "forest", ]
coleoptera <- insect_data[insect_data$order == "Coleoptera", ]

# Select specific columns
selected <- insect_data[, c("site", "habitat", "morphospecies", "abundance")]

# Combine conditions
forest_beetles <- insect_data[
  insect_data$habitat == "forest" & insect_data$order == "Coleoptera", 
]

2.4.1 💡 Practice Exercise (10 min)

Import sample_insect_data.csv
How many rows and columns does it have?
What are the unique habitat types?
How many records are there for each order?
Are there any missing values?

3 Session 3: Data Wrangling with tidyverse (75 min)

3.1 The Pipe Operator

The pipe %>% chains operations together. Read it as “then”.

library(tidyverse)

# Without pipe (nested, hard to read):
round(mean(sqrt(c(1, 4, 9, 16))), 2)

# With pipe (step by step, easy to read):
c(1, 4, 9, 16) %>%
  sqrt() %>%
  mean() %>%
  round(2)

# Read as: "Take 1,4,9,16, THEN sqrt, THEN mean, THEN round"

Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)

3.2 Core dplyr Verbs

3.2.1 filter(): Keep Rows

# Keep only forest sites
insect_data %>%
  filter(habitat == "forest")

# Multiple conditions (AND)
insect_data %>%
  filter(habitat == "forest", abundance > 10)

# Multiple options (OR)
insect_data %>%
  filter(habitat %in% c("forest", "grassland"))

# Exclude
insect_data %>%
  filter(habitat != "agriculture")

3.2.2 select(): Choose Columns

# Keep specific columns
insect_data %>%
  select(site, habitat, morphospecies, abundance)

# Exclude columns
insect_data %>%
  select(-trap_id, -family)

# Select helpers
insect_data %>%
  select(starts_with("site"))

insect_data %>%
  select(contains("species"))

3.2.3 mutate(): Create New Columns

insect_data %>%
  mutate(
    log_abundance = log(abundance + 1),
    abundance_doubled = abundance * 2,
    habitat_upper = toupper(habitat)
  )

3.2.4 arrange(): Sort Rows

# Ascending
insect_data %>%
  arrange(abundance)

# Descending
insect_data %>%
  arrange(desc(abundance))

# Multiple columns
insect_data %>%
  arrange(habitat, desc(abundance))

3.2.5 summarise(): Calculate Summaries

insect_data %>%
  summarise(
    total_abundance = sum(abundance),
    mean_abundance = mean(abundance),
    n_records = n()
  )

3.2.6 group_by() + summarise(): Summaries by Group

This is extremely powerful!

# Summary by habitat
insect_data %>%
  group_by(habitat) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    n_sites = n_distinct(site),
    .groups = "drop"
  )

# Summary by habitat AND order
insect_data %>%
  group_by(habitat, order) %>%
  summarise(
    mean_abundance = mean(abundance),
    n_species = n_distinct(morphospecies),
    .groups = "drop"
  )

3.3 Reshaping Data

3.3.1 Wide vs Long Format

Long format (one observation per row):

site  | species     | abundance
------|-------------|----------
S01   | Carabus_sp1 | 12
S01   | Carabus_sp2 | 5
S02   | Carabus_sp1 | 18

Wide format (community matrix):

site  | Carabus_sp1 | Carabus_sp2
------|-------------|------------
S01   | 12          | 5
S02   | 18          | 0

3.3.2 pivot_wider(): Long to Wide

# Create community matrix
community_matrix <- insect_data %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,
    values_from = abundance,
    values_fill = 0           # Fill missing with 0
  )

head(community_matrix)

3.3.3 pivot_longer(): Wide to Long

# Convert back to long format
long_data <- community_matrix %>%
  pivot_longer(
    cols = -site,             # All columns except site
    names_to = "species",
    values_to = "abundance"
  )

3.4 Chaining Multiple Operations

# Complete data preparation pipeline
beetle_summary <- insect_data %>%
  # Filter to beetles only
  filter(order == "Coleoptera") %>%
  # Group by site and habitat
  group_by(site, habitat, landscape) %>%
  # Calculate summary statistics
  summarise(
    abundance = sum(abundance),
    richness = n_distinct(morphospecies),
    .groups = "drop"
  ) %>%
  # Add new columns
  mutate(
    log_abundance = log(abundance + 1)
  ) %>%
  # Sort by richness
  arrange(desc(richness))

beetle_summary

3.4.1 💡 Practice Exercise (15 min)

Using insect_data:

Filter to keep only Hymenoptera
Group by habitat and calculate total abundance and species richness
Arrange by total abundance (descending)
Which habitat has the most Hymenoptera?

3.5 Session 4: Data Exploration & Cleaning (60 min)

3.6 Exploring Your Data: Finding the Story

Data exploration is detective work. Find the patterns before testing them.

3.6.1 Overall Dataset Summary

# Quick overview
cat("=== DATASET OVERVIEW ===\n")
cat("Total records:", nrow(insect_data), "\n")
cat("Total individuals:", sum(insect_data$abundance), "\n")
cat("Number of sites:", n_distinct(insect_data$site), "\n")
cat("Number of habitats:", n_distinct(insect_data$habitat), "\n")
cat("Number of orders:", n_distinct(insect_data$order), "\n")
cat("Number of morphospecies:", n_distinct(insect_data$morphospecies), "\n")

3.6.2 Summary by Taxonomic Group

order_summary <- insect_data %>%
  group_by(order) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    n_sites = n_distinct(site),
    .groups = "drop"
  ) %>%
  arrange(desc(total_abundance))

order_summary

3.7 Choosing Focal Taxa

Not all groups are suitable for analysis. Choose wisely!

3.7.1 Criteria for Focal Taxa

Criterion	Minimum	Why
Total abundance	≥ 50	Statistical power
Species richness	≥ 5	Diversity to analyze
Sites present	≥ 50% of sites	Not just rare occurrence
Habitat variation	CV > 20%	Interesting patterns
Ecological relevance	High	Meaningful interpretation

3.7.2 Evaluating Groups

# Calculate habitat variation
habitat_variation <- insect_data %>%
  group_by(order, habitat) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  group_by(order) %>%
  summarise(
    cv_abundance = sd(abundance) / mean(abundance) * 100,
    .groups = "drop"
  )

# Combine with order summary
order_evaluation <- order_summary %>%
  left_join(habitat_variation, by = "order") %>%
  mutate(
    recommended = total_abundance >= 50 & 
                  n_species >= 5 & 
                  cv_abundance >= 20
  )

order_evaluation

3.7.3 Ecological Considerations for Pitfall Traps

Group	Pitfall Suitability	Notes
Carabidae (ground beetles)	Excellent	Well-studied indicators
Formicidae (ants)	Excellent	Colonial, sensitive to disturbance
Araneae (spiders)	Good	Predators, hunting guilds
Staphylinidae (rove beetles)	Good	Diverse decomposers
Orthoptera (grasshoppers)	Moderate	Vegetation-dependent
Flying insects	Poor	Undersampled by pitfalls

3.8 Data Cleaning

3.8.1 Common Problems and Solutions

# Check for problems
unique(insect_data$habitat)  # Look for typos, case issues

# Standardize text
clean_data <- insect_data %>%
  mutate(
    habitat = tolower(trimws(habitat)),     # Lowercase, remove spaces
    habitat = case_when(                     # Fix typos
      habitat == "forrest" ~ "forest",
      habitat == "grasland" ~ "grassland",
      TRUE ~ habitat
    )
  )

# Handle impossible values
clean_data <- clean_data %>%
  filter(abundance >= 0) %>%                # Remove negative values
  filter(!is.na(abundance))                 # Remove missing values

3.9 Creating Analysis-Ready Data

3.9.1 Community Matrix

# For Coleoptera analysis
comm_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,
    values_from = abundance,
    values_fill = 0
  ) %>%
  column_to_rownames("site")

head(comm_matrix)

3.9.2 Environmental Data

# Create matching environmental data
env_data <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%
  column_to_rownames("site")

# Ensure same order as community matrix
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]

# Verify alignment
all(rownames(comm_matrix) == rownames(env_data))  # Must be TRUE!

3.9.3 Save Processed Data

# Save for tomorrow
write.csv(comm_matrix, "data/processed/beetle_community_matrix.csv")
write.csv(env_data, "data/processed/environmental_data.csv")

3.9.4 💡 Practice Exercise (10 min)

Evaluate which orders in your dataset meet the criteria for analysis
Choose one focal order and create:
- A community matrix (sites × species)
- A matching environmental data frame
Save both to data/processed/

END OF DAY 1

Summary

Today you learned:

✅ Project organization and why it matters
✅ Importing and exploring data
✅ Data wrangling with dplyr
✅ Choosing focal taxa
✅ Creating community matrices

Homework (Optional)

Complete any unfinished practice exercises
Try the Day 1 Capstone Exercise (in the Exercises document)
Review the concepts we covered

Tomorrow

Visualization with ggplot2
Diversity metrics
NMDS ordination
PERMANOVA

4 DAY 2: ANALYSIS & INTERPRETATION

5 Session 5: Visualization with ggplot2 (60 min)

5.1 5.1 Grammar of Graphics

Every ggplot has:

Data: What you’re plotting
Aesthetics (aes): How variables map to visual properties
Geometries (geom): What shapes represent the data

library(ggplot2)

# Basic structure:
# ggplot(data, aes(x = var1, y = var2)) + geom_*()

5.2 5.2 Essential Plot Types

5.2.1 Scatter Plot

# Prepare site-level data
site_summary <- insect_data %>%
  filter(order == "Coleoptera") %>%
  group_by(site, habitat) %>%
  summarise(
    abundance = sum(abundance),
    richness = n_distinct(morphospecies),
    .groups = "drop"
  )

# Basic scatter
ggplot(site_summary, aes(x = abundance, y = richness)) +
  geom_point()

# Enhanced scatter
ggplot(site_summary, aes(x = abundance, y = richness, color = habitat)) +
  geom_point(size = 4, alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE) +
  scale_color_brewer(palette = "Set1") +
  labs(
    x = "Total Abundance",
    y = "Species Richness",
    color = "Habitat",
    title = "Beetle Richness vs. Abundance"
  ) +
  theme_bw()

5.2.2 Box Plot

# Basic boxplot
ggplot(site_summary, aes(x = habitat, y = richness, fill = habitat)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.1, alpha = 0.5) +
  scale_fill_brewer(palette = "Set2") +
  labs(x = "Habitat", y = "Species Richness") +
  theme_bw() +
  theme(legend.position = "none")

5.2.3 Bar Plot with Error Bars

# Calculate summary statistics
habitat_means <- site_summary %>%
  group_by(habitat) %>%
  summarise(
    mean_richness = mean(richness),
    se_richness = sd(richness) / sqrt(n()),
    .groups = "drop"
  )

# Bar plot
ggplot(habitat_means, aes(x = habitat, y = mean_richness, fill = habitat)) +
  geom_col(alpha = 0.8) +
  geom_errorbar(
    aes(ymin = mean_richness - se_richness,
        ymax = mean_richness + se_richness),
    width = 0.2
  ) +
  scale_fill_brewer(palette = "Set2") +
  labs(x = "Habitat", y = "Mean Richness (± SE)") +
  theme_bw() +
  theme(legend.position = "none")

5.3 5.3 Saving Plots

# Create a plot object
p <- ggplot(site_summary, aes(x = habitat, y = richness, fill = habitat)) +
  geom_boxplot() +
  theme_bw()

# Save to figures folder
ggsave("figures/richness_boxplot.png", p, width = 8, height = 6, dpi = 300)
ggsave("figures/richness_boxplot.pdf", p, width = 8, height = 6)

6 Session 6: Diversity Metrics (60 min)

6.1 6.1 Understanding Diversity

Alpha diversity = diversity within a site

Index	What it Measures	Sensitive To
Richness (S)	Number of species	Sampling effort, rare species
Shannon (H’)	Richness + evenness	Rare species
Simpson (1-D)	Dominance	Common species

6.2 6.2 Calculating Diversity

library(vegan)

# Using the community matrix we created
# comm_matrix: rows = sites, columns = species, values = abundance

# Species richness
richness <- specnumber(comm_matrix)

# Shannon diversity
shannon <- diversity(comm_matrix, index = "shannon")

# Simpson diversity
simpson <- diversity(comm_matrix, index = "simpson")

# Evenness (Pielou's J)
evenness <- shannon / log(richness)

# Compile into data frame
alpha_div <- data.frame(
  site = rownames(comm_matrix),
  richness = richness,
  shannon = round(shannon, 3),
  simpson = round(simpson, 3),
  evenness = round(evenness, 3)
)

# Add environmental data
alpha_div <- alpha_div %>%
  left_join(env_data %>% rownames_to_column("site"), by = "site")

print(alpha_div)

6.3 6.3 Rarefaction

Problem: Sites with more individuals tend to have more species just by chance.

Solution: Rarefaction standardizes to equal sample size.

# Check sample sizes
sample_sizes <- rowSums(comm_matrix)
print(sample_sizes)

# Rarefy to minimum sample size
min_n <- min(sample_sizes)
rarefied <- rarefy(comm_matrix, sample = min_n)

# Rarefaction curves
rarecurve(comm_matrix, step = 1, col = rainbow(nrow(comm_matrix)),
          xlab = "Individuals", ylab = "Species", main = "Rarefaction Curves")
abline(v = min_n, lty = 2)

6.4 6.4 Testing Diversity Differences

# Kruskal-Wallis test (non-parametric)
kruskal.test(shannon ~ habitat, data = alpha_div)

# Visualize
ggplot(alpha_div, aes(x = habitat, y = shannon, fill = habitat)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.1, size = 2) +
  scale_fill_brewer(palette = "Set2") +
  labs(x = "Habitat", y = "Shannon Diversity (H')") +
  theme_bw() +
  theme(legend.position = "none")

7 Session 7: Multivariate Analysis - NMDS (75 min)

7.1 7.1 Why Multivariate Analysis?

Alpha diversity tells us about individual sites. Beta diversity tells us how communities differ between sites.

NMDS (Non-metric Multidimensional Scaling) visualizes community similarity: - Sites close together = similar communities - Sites far apart = different communities

7.2 7.2 Distance Matrices

# Bray-Curtis dissimilarity (standard for abundance data)
dist_bray <- vegdist(comm_matrix, method = "bray")

# View as matrix
round(as.matrix(dist_bray), 2)

Bray-Curtis properties: - Range: 0 (identical) to 1 (completely different) - Accounts for abundance, not just presence - Joint absences don’t make sites similar

7.3 7.3 Running NMDS

set.seed(123)  # For reproducibility!

nmds <- metaMDS(
  comm_matrix,
  distance = "bray",
  k = 2,              # Number of dimensions
  trymax = 100        # Number of attempts
)

# Check stress
cat("Stress:", nmds$stress, "\n")

Stress interpretation:

Stress	Quality
< 0.05	Excellent
0.05 - 0.10	Good
0.10 - 0.20	Acceptable
> 0.20	Poor - interpret with caution

# Shepard diagram: check fit
stressplot(nmds)

7.4 7.4 Creating NMDS Plot

# Extract scores
nmds_scores <- as.data.frame(scores(nmds, display = "sites"))
nmds_scores$site <- rownames(nmds_scores)

# Add environmental data
nmds_scores <- nmds_scores %>%
  left_join(env_data %>% rownames_to_column("site"), by = "site")

# Create plot
ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, color = habitat)) +
  geom_point(size = 4, alpha = 0.8) +
  stat_ellipse(level = 0.95, linetype = 2) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "NMDS of Beetle Communities",
    subtitle = paste("Stress =", round(nmds$stress, 3)),
    color = "Habitat"
  ) +
  theme_bw() +
  coord_equal()  # Equal axes for ordination!

# Save
ggsave("figures/nmds_plot.png", width = 8, height = 6, dpi = 300)

7.5 7.5 Adding Species

# Extract species scores
species_scores <- as.data.frame(scores(nmds, display = "species"))
species_scores$species <- rownames(species_scores)

# Plot with species
ggplot() +
  geom_point(data = nmds_scores, 
             aes(x = NMDS1, y = NMDS2, color = habitat),
             size = 4) +
  geom_text(data = species_scores,
            aes(x = NMDS1, y = NMDS2, label = species),
            size = 2.5, alpha = 0.7) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  coord_equal()

8 Session 8: Statistical Testing (60 min)

8.1 8.1 PERMANOVA

Question: Are communities significantly different between habitats?

PERMANOVA (Permutational Multivariate ANOVA) tests this using the distance matrix.

permanova <- adonis2(
  comm_matrix ~ habitat,
  data = env_data,
  method = "bray",
  permutations = 999
)

print(permanova)

8.1.1 Interpreting Results

Column	Meaning
Df	Degrees of freedom
SumOfSqs	Variation explained
R2	Proportion of variation (effect size!)
F	Pseudo-F statistic
Pr(>F)	P-value

Effect size interpretation (R²):

R²	Effect
< 0.05	Tiny
0.05 - 0.10	Small
0.10 - 0.25	Medium
> 0.25	Large

8.2 8.2 Check Dispersion

Assumption: Groups should have similar within-group variation.

# Test homogeneity of dispersions
dispersion <- betadisper(dist_bray, env_data$habitat)
permutest(dispersion)

# Visualize
boxplot(dispersion, main = "Distance to Centroid by Habitat")

8.3 8.3 SIMPER: Which Species?

Question: Which species contribute most to the differences?

simper_result <- simper(comm_matrix, env_data$habitat, permutations = 999)
summary(simper_result)

Interpreting SIMPER:

Column	Meaning
average	Average contribution to dissimilarity
sd	Standard deviation
ratio	average/sd (>1 = consistent)
ava, avb	Average abundance in each group
cumsum	Cumulative contribution

8.4 8.4 Putting It Together

# Complete NMDS figure with results
p_final <- ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, color = habitat)) +
  geom_point(size = 5, alpha = 0.8) +
  stat_ellipse(level = 0.95, linetype = 2, linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Beetle Community Composition",
    subtitle = sprintf("NMDS stress = %.3f | PERMANOVA: R² = %.2f, p = %.3f",
                       nmds$stress, permanova$R2[1], permanova$`Pr(>F)`[1]),
    color = "Habitat"
  ) +
  theme_bw(base_size = 14) +
  coord_equal() +
  theme(
    legend.position = "right",
    plot.subtitle = element_text(size = 10)
  )

print(p_final)
ggsave("figures/nmds_final.png", p_final, width = 10, height = 8, dpi = 300)

Wrap-up & Next Steps (15 min)

What You Learned

Day 1: - ✅ Project organization (the most important skill!) - ✅ Data import and exploration - ✅ Data wrangling with tidyverse - ✅ Choosing focal taxa - ✅ Creating community matrices

Day 2: - ✅ Visualization with ggplot2 - ✅ Diversity metrics (richness, Shannon, Simpson) - ✅ NMDS ordination - ✅ PERMANOVA and SIMPER

Complete Workflow

# 1. Setup
library(tidyverse)
library(vegan)

# 2. Import
raw_data <- read.csv("data/raw/my_data.csv")

# 3. Clean & prepare
clean_data <- raw_data %>%
  filter(...) %>%
  mutate(...)

comm_matrix <- clean_data %>%
  pivot_wider(...)

env_data <- clean_data %>%
  select(site, habitat) %>%
  distinct()

# 4. Diversity
alpha_div <- data.frame(
  richness = specnumber(comm_matrix),
  shannon = diversity(comm_matrix, "shannon")
)

# 5. Ordination
nmds <- metaMDS(comm_matrix, distance = "bray")

# 6. Statistics
permanova <- adonis2(comm_matrix ~ habitat, data = env_data)

# 7. Interpret!

Resources for Continued Learning

Books: - Numerical Ecology with R (Borcard, Gillet & Legendre) - R for Data Science (Wickham & Grolemund) - free online

Online: - GUSTA ME: https://sites.google.com/site/mb3gustame/ - R-bloggers, Stack Overflow

Packages to explore: - BiodiversityR - comprehensive biodiversity analysis - indicspecies - indicator species analysis - iNEXT - diversity interpolation/extrapolation

Quick Reference

Key Functions

Task	Function	Package
Import data	`read.csv()`	base
Filter rows	`filter()`	dplyr
Select columns	`select()`	dplyr
Create columns	`mutate()`	dplyr
Group & summarize	`group_by() %>% summarise()`	dplyr
Reshape to wide	`pivot_wider()`	tidyr
Plot	`ggplot()`	ggplot2
Save plot	`ggsave()`	ggplot2
Richness	`specnumber()`	vegan
Shannon	`diversity(x, "shannon")`	vegan
Distance matrix	`vegdist()`	vegan
NMDS	`metaMDS()`	vegan
PERMANOVA	`adonis2()`	vegan
SIMPER	`simper()`	vegan

Interpretation Guide

NMDS Stress: < 0.1 good | 0.1-0.2 acceptable | > 0.2 poor

PERMANOVA R²: < 0.05 tiny | 0.05-0.10 small | 0.10-0.25 medium | > 0.25 large

Happy coding and happy bug hunting!

Workshop: Introduction to R Statistics for Insect Ecology

A 2-Day Workshop for Entomology Students

Workshop Manual by Amanda Mawan

11-12 February 2026