Workshop Overview

Schedule

╔══════════════════════════════════════════════════════════════════════════════╗
║  DAY 1: DATA FOUNDATIONS                         9:00 - 15:00                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  09:00 - 10:15  Session 1: Project Setup & R Essentials            75 min    ║
║  10:15 - 10:30  ☕ Break                                            15 min    ║
║  10:30 - 11:30  Session 2: Importing & Exploring Data              60 min    ║
║  11:30 - 12:30  🍽️ LUNCH                                            60 min   ║
║  12:30 - 13:45  Session 3: Data Wrangling with tidyverse           75 min    ║
║  13:45 - 14:00  ☕ Break                                            15 min    ║
║  14:00 - 15:00  Session 4: Data Exploration & Cleaning             60 min    ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  DAY 2: ANALYSIS & INTERPRETATION                9:00 - 15:00                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  09:00 - 10:00  Session 5: Visualization with ggplot2              60 min    ║
║  10:00 - 10:15  ☕ Break                                            15 min    ║
║  10:15 - 11:15  Session 6: Diversity Metrics                       60 min    ║
║  11:15 - 12:15  🍽️ LUNCH                                            60 min   ║
║  12:15 - 13:30  Session 7: Multivariate Analysis (NMDS)            75 min    ║
║  13:30 - 13:45  ☕ Break                                            15 min    ║
║  13:45 - 14:45  Session 8: Statistical Testing                     60 min    ║
║  14:45 - 15:00  Wrap-up & Next Steps                               15 min    ║
╚══════════════════════════════════════════════════════════════════════════════╝

Prerequisites

Before this workshop, you should have completed:

✅ Installed R and RStudio
✅ Installed required packages (tidyverse, vegan, ape, indicspecies)
✅ Created the workshop project with folder structure
✅ Completed the Pre-Workshop Preparation exercises

The Ecological Context

When we sample insects across different habitats or landscapes, we’re fundamentally asking:

“Do different environments support different insect communities, and why?”

This connects to ecological theory:

Niche theory: Species occur where conditions match their requirements
Habitat filtering: Environmental conditions “filter” which species persist
Dispersal limitation: Species can only occur where they can reach

Our statistical analyses detect these patterns and test hypotheses.

Analysis Workflow Overview

RAW DATA → CLEAN & PREPARE → EXPLORE → ANALYZE → INTERPRET
              │                  │         │
              │                  │         ├── Alpha diversity
              │                  │         ├── NMDS ordination
              │                  │         └── PERMANOVA
              │                  │
              │                  └── Which taxa to focus on?
              │
              └── Community matrix + Environmental data

DAY 1: DATA FOUNDATIONS

1 Session 1: Project Setup & R Essentials (75 min)

1.1 Opening Your Workshop Project

🔑 KEY CONCEPT: Always work within an R Project. Never use setwd().

1.1.1 Starting the Workshop

Open RStudio
Go to File → Open Project
Navigate to your Insect_Ecology_Workshop folder
Select the .Rproj file
Click Open

Verify you’re in the right place:

# Check your working directory
getwd()
# Should show: ".../Insect_Ecology_Workshop"

# List files in your project
list.files()
# Should show: data, scripts, output, figures folders

1.1.2 If You Haven’t Created the Project Yet

# Create folder structure
dir.create("data")
dir.create("data/raw")
dir.create("data/processed")
dir.create("scripts")
dir.create("output")
dir.create("figures")

# Verify
list.files()

1.2 Project Organization Best Practices

1.2.1 The Golden Structure

Insect_Ecology_Workshop/
│
├── Insect_Ecology_Workshop.Rproj   ← Always open THIS file!
│
├── data/
│   ├── raw/                        ← Original data (NEVER modify!)
│   │   ├── sample_insect_data.csv
│   │   └── pollinator_data.csv
│   └── processed/                  ← Cleaned data goes here
│       └── beetles_clean.csv
│
├── scripts/                        ← Your R scripts
│   ├── 01_data_import.R
│   ├── 02_exploration.R
│   ├── 03_diversity.R
│   └── 04_ordination.R
│
├── output/                         ← Tables, results
│   └── diversity_table.csv
│
└── figures/                        ← Saved plots
    ├── nmds_plot.png
    └── diversity_boxplot.pdf

1.2.2 Why This Matters

# ❌ BAD - Absolute paths that break on other computers:
setwd("C:/Users/John/Desktop/thesis/chapter2/data")
data <- read.csv("beetles.csv")

# ❌ BAD - Files scattered everywhere:
data <- read.csv("C:/Users/John/Downloads/beetles.csv")

# ✅ GOOD - Relative paths from project folder:
data <- read.csv("data/raw/beetles.csv")
# Works on ANY computer with the same folder structure!

# ✅ GOOD - Saving outputs to organized folders:
write.csv(results, "output/diversity_results.csv")
ggsave("figures/nmds_plot.png")

1.2.3 The Raw Data Rule

Never modify files in data/raw/!

Your raw data should remain exactly as you received it. If you need to clean or modify data:

Read from data/raw/
Clean in R
Save to data/processed/

# Read raw data
beetles_raw <- read.csv("data/raw/beetle_survey.csv")

# Clean it
beetles_clean <- beetles_raw %>%      # beetles_clean is the name of the new cleaned object.
  filter(!is.na(abundance)) %>%       
  # This function keeps only the rows that meet a certain condition-keeps only the rows where the abundance is not missing.
  mutate(habitat = tolower(habitat))  
  # This function is used to create a new column or modify an existing one.
  # tolower() function  converts all text in the `habitat` column to **lowercase**.

# Save cleaned version on processed folder
write.csv(beetles_clean, "data/processed/beetles_clean.csv", row.names = FALSE)

1.3 R Essentials Quick Review

1.3.1 Objects and Assignment

# Store values in objects using <-
n_sites <- 12
study_area <- "West Java"
is_dry_season <- TRUE

# View objects by typing their name
n_sites
study_area

# Keyboard shortcut for <- : Alt + - (Windows) or Option + - (Mac)

1.3.2 Functions

# Functions perform actions: function_name(arguments)
mean(c(10, 20, 30))        # Calculate mean
sum(c(1, 2, 3, 4, 5))      # Calculate sum
length(c(5, 10, 15))       # Count elements

# Functions can have multiple arguments
round(3.14159, digits = 2)  # Round to 2 decimal places

1.3.3 Vectors

# Vectors are collections of values
abundances <- c(12, 45, 23, 8, 67, 34)

# Operations on vectors
sum(abundances)            # 189
mean(abundances)           # 31.5
abundances * 2             # Doubles each element

# Indexing: extract elements with [ ]
abundances[1]              # First element: 12
abundances[c(1, 3, 5)]     # Elements 1, 3, 5
abundances[abundances > 30] # Elements greater than 30

1.3.4 Data Frames

# Data frames are like spreadsheets
beetle_data <- data.frame(
  site = c("A", "B", "C", "D"),
  habitat = c("forest", "forest", "grassland", "grassland"),
  abundance = c(45, 32, 67, 51)
)

# Access columns with $
beetle_data$abundance
beetle_data$habitat

# Access rows with [ row , column ]
beetle_data[1, ]           # First row
beetle_data[, 3]           # Third column
beetle_data[beetle_data$habitat == "forest", ]  # Forest rows only

1.4 Loading Packages

# Load packages at the start of EVERY session
library(tidyverse)    # Data wrangling and visualization
library(vegan)        # Community ecology
library(ape)          # PCoA

# If you get "there is no package called 'x'":
# install.packages("x")

1.5 Creating Your First Script

Create a new script for this workshop:

File → New File → R Script
Save as: scripts/01_day1_analysis.R

Script header template:

#============================================================================
# Insect Community Analysis - Day 1
# Author: [Your Name]
# Date: [Today's Date]
# Description: Import, clean, and explore insect survey data
#============================================================================

# Load packages ----
library(tidyverse)
library(vegan)

# Import data ----

# Data exploration ----

# Analysis ----

1.5.1 💡 Practice Exercise (5 min)

Create a new script called scripts/practice.R
Add a header with your name and today’s date
Load the tidyverse package
Create a vector called my_counts with values 5, 12, 8, 20, 15
Calculate the mean and save it to an object called avg_count
Save your script!

2 Session 2: Importing & Exploring Data (60 min)

2.1 Importing CSV Files

# Make sure your data file is in data/raw/
# Import with read.csv()
insect_data <- read.csv("data/raw/sample_insect_data.csv")

# Common options for messy data:
insect_data <- read.csv(
  "data/raw/sample_insect_data.csv",
  header = TRUE,                    # First row is column names
  stringsAsFactors = FALSE,         # Keep text as text
  na.strings = c("", "NA", "N/A")   # Recognize these as missing
)

2.2 First Look at Your Data

Always explore data immediately after importing!

# View the first few rows
head(insect_data)

# View in spreadsheet format (capital V!)
View(insect_data)

# Structure: column types and preview
str(insect_data)

# Dimensions: rows × columns
dim(insect_data)
nrow(insect_data)
ncol(insect_data)

# Column names
names(insect_data)

# Summary statistics
summary(insect_data)

2.3 Checking for Problems

# Missing values
sum(is.na(insect_data))           # Total NAs
colSums(is.na(insect_data))       # NAs per column

# Unique values in categorical columns
unique(insect_data$habitat)
unique(insect_data$order)

# Check for unexpected values
table(insect_data$habitat)        # Frequency table
table(insect_data$order)

# Check numeric ranges
range(insect_data$abundance)
summary(insect_data$abundance)

2.4 Basic Subsetting

# Filter rows by condition
forest_data <- insect_data[insect_data$habitat == "forest", ]
coleoptera <- insect_data[insect_data$order == "Coleoptera", ]

# Select specific columns
selected <- insect_data[, c("site", "habitat", "morphospecies", "abundance")]

# Combine conditions
forest_beetles <- insect_data[
  insect_data$habitat == "forest" & insect_data$order == "Coleoptera", 
]

2.4.1 💡 Practice Exercise (10 min)

Import sample_insect_data.csv
How many rows and columns does it have?
What are the unique habitat types?
How many records are there for each order?
Are there any missing values?

3 Session 3: Data Wrangling with tidyverse (75 min)

3.1 The Pipe Operator

The pipe %>% chains operations together. Read it as “then”.

library(tidyverse)

# Without pipe (nested, hard to read):
round(mean(sqrt(c(1, 4, 9, 16))), 2)

# With pipe (step by step, easy to read):
c(1, 4, 9, 16) %>%
  sqrt() %>%
  mean() %>%
  round(2)

# Read as: "Take 1,4,9,16, THEN sqrt, THEN mean, THEN round"

Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)

3.2 Core dplyr Verbs

3.2.1 filter(): Keep Rows

# Keep only forest sites
insect_data %>%
  filter(habitat == "forest")

# Multiple conditions (AND)
insect_data %>%
  filter(habitat == "forest", abundance > 10)

# Multiple options (OR)
insect_data %>%
  filter(habitat %in% c("forest", "grassland"))

# Exclude
insect_data %>%
  filter(habitat != "agriculture")

3.2.2 select(): Choose Columns

# Keep specific columns
insect_data %>%
  select(site, habitat, morphospecies, abundance)

# Exclude columns
insect_data %>%
  select(-trap_id, -family)

# Select helpers
insect_data %>%
  select(starts_with("site"))

insect_data %>%
  select(contains("species"))

3.2.3 mutate(): Create New Columns

insect_data %>%
  mutate(
    log_abundance = log(abundance + 1),
    abundance_doubled = abundance * 2,
    habitat_upper = toupper(habitat)
  )

3.2.4 arrange(): Sort Rows

# Ascending
insect_data %>%
  arrange(abundance)

# Descending
insect_data %>%
  arrange(desc(abundance))

# Multiple columns
insect_data %>%
  arrange(habitat, desc(abundance))

3.2.5 summarise(): Calculate Summaries

insect_data %>%
  summarise(
    total_abundance = sum(abundance),
    mean_abundance = mean(abundance),
    n_records = n()
  )

3.2.6 group_by() + summarise(): Summaries by Group

This is extremely powerful!

# Summary by habitat
insect_data %>%
  group_by(habitat) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    n_sites = n_distinct(site),
    .groups = "drop"
  )

# Summary by habitat AND order
insect_data %>%
  group_by(habitat, order) %>%
  summarise(
    mean_abundance = mean(abundance),
    n_species = n_distinct(morphospecies),
    .groups = "drop"
  )

3.3 Reshaping Data

3.3.1 Wide vs Long Format

Long format (one observation per row):

site  | species     | abundance
------|-------------|----------
S01   | Carabus_sp1 | 12
S01   | Carabus_sp2 | 5
S02   | Carabus_sp1 | 18

Wide format (community matrix):

site  | Carabus_sp1 | Carabus_sp2
------|-------------|------------
S01   | 12          | 5
S02   | 18          | 0

3.3.2 pivot_wider(): Long to Wide

# Create community matrix
community_matrix <- insect_data %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,
    values_from = abundance,
    values_fill = 0           # Fill missing with 0
  )

head(community_matrix)

3.3.3 pivot_longer(): Wide to Long

# Convert back to long format
long_data <- community_matrix %>%
  pivot_longer(
    cols = -site,             # All columns except site
    names_to = "species",
    values_to = "abundance"
  )

3.4 Chaining Multiple Operations

# Complete data preparation pipeline
beetle_summary <- insect_data %>%
  # Filter to beetles only
  filter(order == "Coleoptera") %>%
  # Group by site and habitat
  group_by(site, habitat, landscape) %>%
  # Calculate summary statistics
  summarise(
    abundance = sum(abundance),
    richness = n_distinct(morphospecies),
    .groups = "drop"
  ) %>%
  # Add new columns
  mutate(
    log_abundance = log(abundance + 1)
  ) %>%
  # Sort by richness
  arrange(desc(richness))

beetle_summary

3.4.1 💡 Practice Exercise (15 min)

Using insect_data:

Filter to keep only Hymenoptera
Group by habitat and calculate total abundance and species richness
Arrange by total abundance (descending)
Which habitat has the most Hymenoptera?

3.5 Session 4: Data Exploration & Cleaning (60 min)

3.6 Exploring Your Data: Finding the Story

Data exploration is detective work. Find the patterns before testing them.

3.6.1 Overall Dataset Summary

# Quick overview
cat("=== DATASET OVERVIEW ===\n")
cat("Total records:", nrow(insect_data), "\n")
cat("Total individuals:", sum(insect_data$abundance), "\n")
cat("Number of sites:", n_distinct(insect_data$site), "\n")
cat("Number of habitats:", n_distinct(insect_data$habitat), "\n")
cat("Number of orders:", n_distinct(insect_data$order), "\n")
cat("Number of morphospecies:", n_distinct(insect_data$morphospecies), "\n")

3.6.2 Summary by Taxonomic Group

order_summary <- insect_data %>%
  group_by(order) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    n_sites = n_distinct(site),
    .groups = "drop"
  ) %>%
  arrange(desc(total_abundance))

order_summary

3.7 Choosing Focal Taxa

Not all groups are suitable for analysis. Choose wisely!

3.7.1 Criteria for Focal Taxa

Criterion	Minimum	Why
Total abundance	≥ 50	Statistical power
Species richness	≥ 5	Diversity to analyze
Sites present	≥ 50% of sites	Not just rare occurrence
Habitat variation	CV > 20%	Interesting patterns
Ecological relevance	High	Meaningful interpretation

3.7.2 Evaluating Groups

# Calculate habitat variation
habitat_variation <- insect_data %>%
  group_by(order, habitat) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  group_by(order) %>%
  summarise(
    cv_abundance = sd(abundance) / mean(abundance) * 100,
    .groups = "drop"
  )

# Combine with order summary
order_evaluation <- order_summary %>%
  left_join(habitat_variation, by = "order") %>%
  mutate(
    recommended = total_abundance >= 50 & 
                  n_species >= 5 & 
                  cv_abundance >= 20
  )

order_evaluation

3.7.3 Ecological Considerations for Pitfall Traps

Group	Pitfall Suitability	Notes
Carabidae (ground beetles)	Excellent	Well-studied indicators
Formicidae (ants)	Excellent	Colonial, sensitive to disturbance
Araneae (spiders)	Good	Predators, hunting guilds
Staphylinidae (rove beetles)	Good	Diverse decomposers
Orthoptera (grasshoppers)	Moderate	Vegetation-dependent
Flying insects	Poor	Undersampled by pitfalls

3.8 Data Cleaning

3.8.1 Common Problems and Solutions

# Check for problems
unique(insect_data$habitat)  # Look for typos, case issues

# Standardize text
clean_data <- insect_data %>%
  mutate(
    habitat = tolower(trimws(habitat)),     # Lowercase, remove spaces
    habitat = case_when(                     # Fix typos
      habitat == "forrest" ~ "forest",
      habitat == "grasland" ~ "grassland",
      TRUE ~ habitat
    )
  )

# Handle impossible values
clean_data <- clean_data %>%
  filter(abundance >= 0) %>%                # Remove negative values
  filter(!is.na(abundance))                 # Remove missing values

3.9 Creating Analysis-Ready Data

3.9.1 Community Matrix

# For Coleoptera analysis
comm_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,
    values_from = abundance,
    values_fill = 0
  ) %>%
  column_to_rownames("site")

head(comm_matrix)

3.9.2 Environmental Data

# Create matching environmental data
env_data <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%
  column_to_rownames("site")

# Ensure same order as community matrix
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]

# Verify alignment
all(rownames(comm_matrix) == rownames(env_data))  # Must be TRUE!

3.9.3 Save Processed Data

# Save for tomorrow
write.csv(comm_matrix, "data/processed/beetle_community_matrix.csv")
write.csv(env_data, "data/processed/environmental_data.csv")

3.9.4 💡 Practice Exercise (10 min)

Evaluate which orders in your dataset meet the criteria for analysis
Choose one focal order and create:
- A community matrix (sites × species)
- A matching environmental data frame
Save both to data/processed/

END OF DAY 1

Summary

Today you learned:

✅ Project organization and why it matters
✅ Importing and exploring data
✅ Data wrangling with dplyr
✅ Choosing focal taxa
✅ Creating community matrices

Homework (Optional)

Complete any unfinished practice exercises
Try the Day 1 Capstone Exercise (in the Exercises document)
Review the concepts we covered

Tomorrow

Visualization with ggplot2
Diversity metrics
NMDS ordination
PERMANOVA

Happy coding and happy bug hunting!

Workshop: Introduction to R Statistics for Insect Ecology

A 2-Day Workshop for Entomology Students

Workshop Manual by Amanda Mawan

11-12 February 2026