╔══════════════════════════════════════════════════════════════════════════════╗
║ DAY 1: DATA FOUNDATIONS 9:00 - 15:00 ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ 09:00 - 10:15 Session 1: Project Setup & R Essentials 75 min ║
║ 10:15 - 10:30 ☕ Break 15 min ║
║ 10:30 - 11:30 Session 2: Importing & Exploring Data 60 min ║
║ 11:30 - 12:30 🍽️ LUNCH 60 min ║
║ 12:30 - 13:45 Session 3: Data Wrangling with tidyverse 75 min ║
║ 13:45 - 14:00 ☕ Break 15 min ║
║ 14:00 - 15:00 Session 4: Data Exploration & Cleaning 60 min ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ DAY 2: ANALYSIS & INTERPRETATION 9:00 - 15:00 ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ 09:00 - 10:00 Session 5: Visualization with ggplot2 60 min ║
║ 10:00 - 10:15 ☕ Break 15 min ║
║ 10:15 - 11:15 Session 6: Diversity Metrics 60 min ║
║ 11:15 - 12:15 🍽️ LUNCH 60 min ║
║ 12:15 - 13:30 Session 7: Multivariate Analysis (NMDS) 75 min ║
║ 13:30 - 13:45 ☕ Break 15 min ║
║ 13:45 - 14:45 Session 8: Statistical Testing 60 min ║
║ 14:45 - 15:00 Wrap-up & Next Steps 15 min ║
╚══════════════════════════════════════════════════════════════════════════════╝
Before this workshop, you should have completed:
When we sample insects across different habitats or landscapes, we’re fundamentally asking:
“Do different environments support different insect communities, and why?”
This connects to ecological theory:
Our statistical analyses detect these patterns and test hypotheses.
RAW DATA → CLEAN & PREPARE → EXPLORE → ANALYZE → INTERPRET
│ │ │
│ │ ├── Alpha diversity
│ │ ├── NMDS ordination
│ │ └── PERMANOVA
│ │
│ └── Which taxa to focus on?
│
└── Community matrix + Environmental data
🔑 KEY CONCEPT: Always work within an R Project. Never use
setwd().
Insect_Ecology_Workshop folder.Rproj fileVerify you’re in the right place:
Insect_Ecology_Workshop/
│
├── Insect_Ecology_Workshop.Rproj ← Always open THIS file!
│
├── data/
│ ├── raw/ ← Original data (NEVER modify!)
│ │ ├── sample_insect_data.csv
│ │ └── pollinator_data.csv
│ └── processed/ ← Cleaned data goes here
│ └── beetles_clean.csv
│
├── scripts/ ← Your R scripts
│ ├── 01_data_import.R
│ ├── 02_exploration.R
│ ├── 03_diversity.R
│ └── 04_ordination.R
│
├── output/ ← Tables, results
│ └── diversity_table.csv
│
└── figures/ ← Saved plots
├── nmds_plot.png
└── diversity_boxplot.pdf
# ❌ BAD - Absolute paths that break on other computers:
setwd("C:/Users/John/Desktop/thesis/chapter2/data")
data <- read.csv("beetles.csv")
# ❌ BAD - Files scattered everywhere:
data <- read.csv("C:/Users/John/Downloads/beetles.csv")
# ✅ GOOD - Relative paths from project folder:
data <- read.csv("data/raw/beetles.csv")
# Works on ANY computer with the same folder structure!
# ✅ GOOD - Saving outputs to organized folders:
write.csv(results, "output/diversity_results.csv")
ggsave("figures/nmds_plot.png")Never modify files in
data/raw/!
Your raw data should remain exactly as you received it. If you need to clean or modify data:
data/raw/data/processed/# Read raw data
beetles_raw <- read.csv("data/raw/beetle_survey.csv")
# Clean it
beetles_clean <- beetles_raw %>% # beetles_clean is the name of the new cleaned object.
filter(!is.na(abundance)) %>%
# This function keeps only the rows that meet a certain condition-keeps only the rows where the abundance is not missing.
mutate(habitat = tolower(habitat))
# This function is used to create a new column or modify an existing one.
# tolower() function converts all text in the `habitat` column to **lowercase**.
# Save cleaned version on processed folder
write.csv(beetles_clean, "data/processed/beetles_clean.csv", row.names = FALSE)# Vectors are collections of values
abundances <- c(12, 45, 23, 8, 67, 34)
# Operations on vectors
sum(abundances) # 189
mean(abundances) # 31.5
abundances * 2 # Doubles each element
# Indexing: extract elements with [ ]
abundances[1] # First element: 12
abundances[c(1, 3, 5)] # Elements 1, 3, 5
abundances[abundances > 30] # Elements greater than 30# Data frames are like spreadsheets
beetle_data <- data.frame(
site = c("A", "B", "C", "D"),
habitat = c("forest", "forest", "grassland", "grassland"),
abundance = c(45, 32, 67, 51)
)
# Access columns with $
beetle_data$abundance
beetle_data$habitat
# Access rows with [ row , column ]
beetle_data[1, ] # First row
beetle_data[, 3] # Third column
beetle_data[beetle_data$habitat == "forest", ] # Forest rows onlyCreate a new script for this workshop:
scripts/01_day1_analysis.RScript header template:
#============================================================================
# Insect Community Analysis - Day 1
# Author: [Your Name]
# Date: [Today's Date]
# Description: Import, clean, and explore insect survey data
#============================================================================
# Load packages ----
library(tidyverse)
library(vegan)
# Import data ----
# Data exploration ----
# Analysis ----scripts/practice.Rmy_counts with values 5, 12, 8,
20, 15avg_count# Make sure your data file is in data/raw/
# Import with read.csv()
insect_data <- read.csv("data/raw/sample_insect_data.csv")
# Common options for messy data:
insect_data <- read.csv(
"data/raw/sample_insect_data.csv",
header = TRUE, # First row is column names
stringsAsFactors = FALSE, # Keep text as text
na.strings = c("", "NA", "N/A") # Recognize these as missing
)Always explore data immediately after importing!
# View the first few rows
head(insect_data)
# View in spreadsheet format (capital V!)
View(insect_data)
# Structure: column types and preview
str(insect_data)
# Dimensions: rows × columns
dim(insect_data)
nrow(insect_data)
ncol(insect_data)
# Column names
names(insect_data)
# Summary statistics
summary(insect_data)# Missing values
sum(is.na(insect_data)) # Total NAs
colSums(is.na(insect_data)) # NAs per column
# Unique values in categorical columns
unique(insect_data$habitat)
unique(insect_data$order)
# Check for unexpected values
table(insect_data$habitat) # Frequency table
table(insect_data$order)
# Check numeric ranges
range(insect_data$abundance)
summary(insect_data$abundance)# Filter rows by condition
forest_data <- insect_data[insect_data$habitat == "forest", ]
coleoptera <- insect_data[insect_data$order == "Coleoptera", ]
# Select specific columns
selected <- insect_data[, c("site", "habitat", "morphospecies", "abundance")]
# Combine conditions
forest_beetles <- insect_data[
insect_data$habitat == "forest" & insect_data$order == "Coleoptera",
]sample_insect_data.csvThe pipe %>% chains operations together. Read it as
“then”.
library(tidyverse)
# Without pipe (nested, hard to read):
round(mean(sqrt(c(1, 4, 9, 16))), 2)
# With pipe (step by step, easy to read):
c(1, 4, 9, 16) %>%
sqrt() %>%
mean() %>%
round(2)
# Read as: "Take 1,4,9,16, THEN sqrt, THEN mean, THEN round"Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)
# Keep only forest sites
insect_data %>%
filter(habitat == "forest")
# Multiple conditions (AND)
insect_data %>%
filter(habitat == "forest", abundance > 10)
# Multiple options (OR)
insect_data %>%
filter(habitat %in% c("forest", "grassland"))
# Exclude
insect_data %>%
filter(habitat != "agriculture")This is extremely powerful!
# Summary by habitat
insect_data %>%
group_by(habitat) %>%
summarise(
total_abundance = sum(abundance),
n_species = n_distinct(morphospecies),
n_sites = n_distinct(site),
.groups = "drop"
)
# Summary by habitat AND order
insect_data %>%
group_by(habitat, order) %>%
summarise(
mean_abundance = mean(abundance),
n_species = n_distinct(morphospecies),
.groups = "drop"
)Long format (one observation per row):
site | species | abundance
------|-------------|----------
S01 | Carabus_sp1 | 12
S01 | Carabus_sp2 | 5
S02 | Carabus_sp1 | 18
Wide format (community matrix):
site | Carabus_sp1 | Carabus_sp2
------|-------------|------------
S01 | 12 | 5
S02 | 18 | 0
# Complete data preparation pipeline
beetle_summary <- insect_data %>%
# Filter to beetles only
filter(order == "Coleoptera") %>%
# Group by site and habitat
group_by(site, habitat, landscape) %>%
# Calculate summary statistics
summarise(
abundance = sum(abundance),
richness = n_distinct(morphospecies),
.groups = "drop"
) %>%
# Add new columns
mutate(
log_abundance = log(abundance + 1)
) %>%
# Sort by richness
arrange(desc(richness))
beetle_summaryUsing insect_data:
Data exploration is detective work. Find the patterns before testing them.
# Quick overview
cat("=== DATASET OVERVIEW ===\n")
cat("Total records:", nrow(insect_data), "\n")
cat("Total individuals:", sum(insect_data$abundance), "\n")
cat("Number of sites:", n_distinct(insect_data$site), "\n")
cat("Number of habitats:", n_distinct(insect_data$habitat), "\n")
cat("Number of orders:", n_distinct(insect_data$order), "\n")
cat("Number of morphospecies:", n_distinct(insect_data$morphospecies), "\n")Not all groups are suitable for analysis. Choose wisely!
| Criterion | Minimum | Why |
|---|---|---|
| Total abundance | ≥ 50 | Statistical power |
| Species richness | ≥ 5 | Diversity to analyze |
| Sites present | ≥ 50% of sites | Not just rare occurrence |
| Habitat variation | CV > 20% | Interesting patterns |
| Ecological relevance | High | Meaningful interpretation |
# Calculate habitat variation
habitat_variation <- insect_data %>%
group_by(order, habitat) %>%
summarise(abundance = sum(abundance), .groups = "drop") %>%
group_by(order) %>%
summarise(
cv_abundance = sd(abundance) / mean(abundance) * 100,
.groups = "drop"
)
# Combine with order summary
order_evaluation <- order_summary %>%
left_join(habitat_variation, by = "order") %>%
mutate(
recommended = total_abundance >= 50 &
n_species >= 5 &
cv_abundance >= 20
)
order_evaluation| Group | Pitfall Suitability | Notes |
|---|---|---|
| Carabidae (ground beetles) | Excellent | Well-studied indicators |
| Formicidae (ants) | Excellent | Colonial, sensitive to disturbance |
| Araneae (spiders) | Good | Predators, hunting guilds |
| Staphylinidae (rove beetles) | Good | Diverse decomposers |
| Orthoptera (grasshoppers) | Moderate | Vegetation-dependent |
| Flying insects | Poor | Undersampled by pitfalls |
# Check for problems
unique(insect_data$habitat) # Look for typos, case issues
# Standardize text
clean_data <- insect_data %>%
mutate(
habitat = tolower(trimws(habitat)), # Lowercase, remove spaces
habitat = case_when( # Fix typos
habitat == "forrest" ~ "forest",
habitat == "grasland" ~ "grassland",
TRUE ~ habitat
)
)
# Handle impossible values
clean_data <- clean_data %>%
filter(abundance >= 0) %>% # Remove negative values
filter(!is.na(abundance)) # Remove missing values# For Coleoptera analysis
comm_matrix <- insect_data %>%
filter(order == "Coleoptera") %>%
group_by(site, morphospecies) %>%
summarise(abundance = sum(abundance), .groups = "drop") %>%
pivot_wider(
names_from = morphospecies,
values_from = abundance,
values_fill = 0
) %>%
column_to_rownames("site")
head(comm_matrix)# Create matching environmental data
env_data <- insect_data %>%
select(site, habitat, landscape) %>%
distinct() %>%
column_to_rownames("site")
# Ensure same order as community matrix
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]
# Verify alignment
all(rownames(comm_matrix) == rownames(env_data)) # Must be TRUE!data/processed/Today you learned:
Every ggplot has:
# Prepare site-level data
site_summary <- insect_data %>%
filter(order == "Coleoptera") %>%
group_by(site, habitat) %>%
summarise(
abundance = sum(abundance),
richness = n_distinct(morphospecies),
.groups = "drop"
)
# Basic scatter
ggplot(site_summary, aes(x = abundance, y = richness)) +
geom_point()
# Enhanced scatter
ggplot(site_summary, aes(x = abundance, y = richness, color = habitat)) +
geom_point(size = 4, alpha = 0.8) +
geom_smooth(method = "lm", se = TRUE) +
scale_color_brewer(palette = "Set1") +
labs(
x = "Total Abundance",
y = "Species Richness",
color = "Habitat",
title = "Beetle Richness vs. Abundance"
) +
theme_bw()# Calculate summary statistics
habitat_means <- site_summary %>%
group_by(habitat) %>%
summarise(
mean_richness = mean(richness),
se_richness = sd(richness) / sqrt(n()),
.groups = "drop"
)
# Bar plot
ggplot(habitat_means, aes(x = habitat, y = mean_richness, fill = habitat)) +
geom_col(alpha = 0.8) +
geom_errorbar(
aes(ymin = mean_richness - se_richness,
ymax = mean_richness + se_richness),
width = 0.2
) +
scale_fill_brewer(palette = "Set2") +
labs(x = "Habitat", y = "Mean Richness (± SE)") +
theme_bw() +
theme(legend.position = "none")# Create a plot object
p <- ggplot(site_summary, aes(x = habitat, y = richness, fill = habitat)) +
geom_boxplot() +
theme_bw()
# Save to figures folder
ggsave("figures/richness_boxplot.png", p, width = 8, height = 6, dpi = 300)
ggsave("figures/richness_boxplot.pdf", p, width = 8, height = 6)Alpha diversity = diversity within a site
| Index | What it Measures | Sensitive To |
|---|---|---|
| Richness (S) | Number of species | Sampling effort, rare species |
| Shannon (H’) | Richness + evenness | Rare species |
| Simpson (1-D) | Dominance | Common species |
library(vegan)
# Using the community matrix we created
# comm_matrix: rows = sites, columns = species, values = abundance
# Species richness
richness <- specnumber(comm_matrix)
# Shannon diversity
shannon <- diversity(comm_matrix, index = "shannon")
# Simpson diversity
simpson <- diversity(comm_matrix, index = "simpson")
# Evenness (Pielou's J)
evenness <- shannon / log(richness)
# Compile into data frame
alpha_div <- data.frame(
site = rownames(comm_matrix),
richness = richness,
shannon = round(shannon, 3),
simpson = round(simpson, 3),
evenness = round(evenness, 3)
)
# Add environmental data
alpha_div <- alpha_div %>%
left_join(env_data %>% rownames_to_column("site"), by = "site")
print(alpha_div)Problem: Sites with more individuals tend to have more species just by chance.
Solution: Rarefaction standardizes to equal sample size.
# Check sample sizes
sample_sizes <- rowSums(comm_matrix)
print(sample_sizes)
# Rarefy to minimum sample size
min_n <- min(sample_sizes)
rarefied <- rarefy(comm_matrix, sample = min_n)
# Rarefaction curves
rarecurve(comm_matrix, step = 1, col = rainbow(nrow(comm_matrix)),
xlab = "Individuals", ylab = "Species", main = "Rarefaction Curves")
abline(v = min_n, lty = 2)# Kruskal-Wallis test (non-parametric)
kruskal.test(shannon ~ habitat, data = alpha_div)
# Visualize
ggplot(alpha_div, aes(x = habitat, y = shannon, fill = habitat)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.1, size = 2) +
scale_fill_brewer(palette = "Set2") +
labs(x = "Habitat", y = "Shannon Diversity (H')") +
theme_bw() +
theme(legend.position = "none")Alpha diversity tells us about individual sites. Beta diversity tells us how communities differ between sites.
NMDS (Non-metric Multidimensional Scaling) visualizes community similarity: - Sites close together = similar communities - Sites far apart = different communities
# Bray-Curtis dissimilarity (standard for abundance data)
dist_bray <- vegdist(comm_matrix, method = "bray")
# View as matrix
round(as.matrix(dist_bray), 2)Bray-Curtis properties: - Range: 0 (identical) to 1 (completely different) - Accounts for abundance, not just presence - Joint absences don’t make sites similar
set.seed(123) # For reproducibility!
nmds <- metaMDS(
comm_matrix,
distance = "bray",
k = 2, # Number of dimensions
trymax = 100 # Number of attempts
)
# Check stress
cat("Stress:", nmds$stress, "\n")Stress interpretation:
| Stress | Quality |
|---|---|
| < 0.05 | Excellent |
| 0.05 - 0.10 | Good |
| 0.10 - 0.20 | Acceptable |
| > 0.20 | Poor - interpret with caution |
# Extract scores
nmds_scores <- as.data.frame(scores(nmds, display = "sites"))
nmds_scores$site <- rownames(nmds_scores)
# Add environmental data
nmds_scores <- nmds_scores %>%
left_join(env_data %>% rownames_to_column("site"), by = "site")
# Create plot
ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, color = habitat)) +
geom_point(size = 4, alpha = 0.8) +
stat_ellipse(level = 0.95, linetype = 2) +
scale_color_brewer(palette = "Set1") +
labs(
title = "NMDS of Beetle Communities",
subtitle = paste("Stress =", round(nmds$stress, 3)),
color = "Habitat"
) +
theme_bw() +
coord_equal() # Equal axes for ordination!
# Save
ggsave("figures/nmds_plot.png", width = 8, height = 6, dpi = 300)# Extract species scores
species_scores <- as.data.frame(scores(nmds, display = "species"))
species_scores$species <- rownames(species_scores)
# Plot with species
ggplot() +
geom_point(data = nmds_scores,
aes(x = NMDS1, y = NMDS2, color = habitat),
size = 4) +
geom_text(data = species_scores,
aes(x = NMDS1, y = NMDS2, label = species),
size = 2.5, alpha = 0.7) +
scale_color_brewer(palette = "Set1") +
theme_bw() +
coord_equal()Question: Are communities significantly different between habitats?
PERMANOVA (Permutational Multivariate ANOVA) tests this using the distance matrix.
permanova <- adonis2(
comm_matrix ~ habitat,
data = env_data,
method = "bray",
permutations = 999
)
print(permanova)| Column | Meaning |
|---|---|
| Df | Degrees of freedom |
| SumOfSqs | Variation explained |
| R2 | Proportion of variation (effect size!) |
| F | Pseudo-F statistic |
| Pr(>F) | P-value |
Effect size interpretation (R²):
| R² | Effect |
|---|---|
| < 0.05 | Tiny |
| 0.05 - 0.10 | Small |
| 0.10 - 0.25 | Medium |
| > 0.25 | Large |
Assumption: Groups should have similar within-group variation.
Question: Which species contribute most to the differences?
Interpreting SIMPER:
| Column | Meaning |
|---|---|
| average | Average contribution to dissimilarity |
| sd | Standard deviation |
| ratio | average/sd (>1 = consistent) |
| ava, avb | Average abundance in each group |
| cumsum | Cumulative contribution |
# Complete NMDS figure with results
p_final <- ggplot(nmds_scores, aes(x = NMDS1, y = NMDS2, color = habitat)) +
geom_point(size = 5, alpha = 0.8) +
stat_ellipse(level = 0.95, linetype = 2, linewidth = 1) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Beetle Community Composition",
subtitle = sprintf("NMDS stress = %.3f | PERMANOVA: R² = %.2f, p = %.3f",
nmds$stress, permanova$R2[1], permanova$`Pr(>F)`[1]),
color = "Habitat"
) +
theme_bw(base_size = 14) +
coord_equal() +
theme(
legend.position = "right",
plot.subtitle = element_text(size = 10)
)
print(p_final)
ggsave("figures/nmds_final.png", p_final, width = 10, height = 8, dpi = 300)Day 1: - ✅ Project organization (the most important skill!) - ✅ Data import and exploration - ✅ Data wrangling with tidyverse - ✅ Choosing focal taxa - ✅ Creating community matrices
Day 2: - ✅ Visualization with ggplot2 - ✅ Diversity metrics (richness, Shannon, Simpson) - ✅ NMDS ordination - ✅ PERMANOVA and SIMPER
# 1. Setup
library(tidyverse)
library(vegan)
# 2. Import
raw_data <- read.csv("data/raw/my_data.csv")
# 3. Clean & prepare
clean_data <- raw_data %>%
filter(...) %>%
mutate(...)
comm_matrix <- clean_data %>%
pivot_wider(...)
env_data <- clean_data %>%
select(site, habitat) %>%
distinct()
# 4. Diversity
alpha_div <- data.frame(
richness = specnumber(comm_matrix),
shannon = diversity(comm_matrix, "shannon")
)
# 5. Ordination
nmds <- metaMDS(comm_matrix, distance = "bray")
# 6. Statistics
permanova <- adonis2(comm_matrix ~ habitat, data = env_data)
# 7. Interpret!Books: - Numerical Ecology with R (Borcard, Gillet & Legendre) - R for Data Science (Wickham & Grolemund) - free online
Online: - GUSTA ME: https://sites.google.com/site/mb3gustame/ - R-bloggers, Stack Overflow
Packages to explore: - BiodiversityR -
comprehensive biodiversity analysis - indicspecies -
indicator species analysis - iNEXT - diversity
interpolation/extrapolation
| Task | Function | Package |
|---|---|---|
| Import data | read.csv() |
base |
| Filter rows | filter() |
dplyr |
| Select columns | select() |
dplyr |
| Create columns | mutate() |
dplyr |
| Group & summarize | group_by() %>% summarise() |
dplyr |
| Reshape to wide | pivot_wider() |
tidyr |
| Plot | ggplot() |
ggplot2 |
| Save plot | ggsave() |
ggplot2 |
| Richness | specnumber() |
vegan |
| Shannon | diversity(x, "shannon") |
vegan |
| Distance matrix | vegdist() |
vegan |
| NMDS | metaMDS() |
vegan |
| PERMANOVA | adonis2() |
vegan |
| SIMPER | simper() |
vegan |
NMDS Stress: < 0.1 good | 0.1-0.2 acceptable | > 0.2 poor
PERMANOVA R²: < 0.05 tiny | 0.05-0.10 small | 0.10-0.25 medium | > 0.25 large
Happy coding and happy bug hunting!