Workshop Overview

Schedule

╔══════════════════════════════════════════════════════════════════════════════╗
β•‘  DAY 1: DATA FOUNDATIONS                         9:00 - 15:00                β•‘
╠══════════════════════════════════════════════════════════════════════════════╣
β•‘  09:00 - 10:15  Session 1: Project Setup & R Essentials            75 min    β•‘
β•‘  10:15 - 10:30  β˜• Break                                            15 min    β•‘
β•‘  10:30 - 11:30  Session 2: Importing & Exploring Data              60 min    β•‘
β•‘  11:30 - 12:30  🍽️ LUNCH                                            60 min   β•‘
β•‘  12:30 - 13:45  Session 3: Data Wrangling with tidyverse           75 min    β•‘
β•‘  13:45 - 14:00  β˜• Break                                            15 min    β•‘
β•‘  14:00 - 15:00  Session 4: Data Exploration & Cleaning             60 min    β•‘
╠══════════════════════════════════════════════════════════════════════════════╣
β•‘  DAY 2: ANALYSIS & INTERPRETATION                9:00 - 15:00                β•‘
╠══════════════════════════════════════════════════════════════════════════════╣
β•‘  09:00 - 10:00  Session 5: Visualization with ggplot2              60 min    β•‘
β•‘  10:00 - 10:15  β˜• Break                                            15 min    β•‘
β•‘  10:15 - 11:15  Session 6: Diversity Metrics                       60 min    β•‘
β•‘  11:15 - 12:15  🍽️ LUNCH                                            60 min   β•‘
β•‘  12:15 - 13:30  Session 7: Multivariate Analysis (NMDS)            75 min    β•‘
β•‘  13:30 - 13:45  β˜• Break                                            15 min    β•‘
β•‘  13:45 - 14:45  Session 8: Statistical Testing                     60 min    β•‘
β•‘  14:45 - 15:00  Wrap-up & Next Steps                               15 min    β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Prerequisites

Before this workshop, you should have completed:

  • βœ… Installed R and RStudio
  • βœ… Installed required packages (tidyverse, vegan, ape, indicspecies)
  • βœ… Created the workshop project with folder structure
  • βœ… Completed the Pre-Workshop Preparation exercises

The Ecological Context

When we sample insects across different habitats or landscapes, we’re fundamentally asking:

β€œDo different environments support different insect communities, and why?”

This connects to ecological theory:

  • Niche theory: Species occur where conditions match their requirements
  • Habitat filtering: Environmental conditions β€œfilter” which species persist
  • Dispersal limitation: Species can only occur where they can reach

Our statistical analyses detect these patterns and test hypotheses.

Analysis Workflow Overview

RAW DATA β†’ CLEAN & PREPARE β†’ EXPLORE β†’ ANALYZE β†’ INTERPRET
              β”‚                  β”‚         β”‚
              β”‚                  β”‚         β”œβ”€β”€ Alpha diversity
              β”‚                  β”‚         β”œβ”€β”€ NMDS ordination
              β”‚                  β”‚         └── PERMANOVA
              β”‚                  β”‚
              β”‚                  └── Which taxa to focus on?
              β”‚
              └── Community matrix + Environmental data

DAY 1: DATA FOUNDATIONS


1 Session 1: Project Setup & R Essentials (75 min)

1.1 Opening Your Workshop Project

πŸ”‘ KEY CONCEPT: Always work within an R Project. Never use setwd().

1.1.1 Starting the Workshop

  1. Open RStudio
  2. Go to File β†’ Open Project
  3. Navigate to your Insect_Ecology_Workshop folder
  4. Select the .Rproj file
  5. Click Open

Verify you’re in the right place:

# Check your working directory
getwd()
# Should show: ".../Insect_Ecology_Workshop"

# List files in your project
list.files()
# Should show: data, scripts, output, figures folders

1.1.2 If You Haven’t Created the Project Yet

# Create folder structure
dir.create("data")
dir.create("data/raw")
dir.create("data/processed")
dir.create("scripts")
dir.create("output")
dir.create("figures")

# Verify
list.files()

1.2 Project Organization Best Practices

1.2.1 The Golden Structure

Insect_Ecology_Workshop/
β”‚
β”œβ”€β”€ Insect_Ecology_Workshop.Rproj   ← Always open THIS file!
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                        ← Original data (NEVER modify!)
β”‚   β”‚   β”œβ”€β”€ sample_insect_data.csv
β”‚   β”‚   └── pollinator_data.csv
β”‚   └── processed/                  ← Cleaned data goes here
β”‚       └── beetles_clean.csv
β”‚
β”œβ”€β”€ scripts/                        ← Your R scripts
β”‚   β”œβ”€β”€ 01_data_import.R
β”‚   β”œβ”€β”€ 02_exploration.R
β”‚   β”œβ”€β”€ 03_diversity.R
β”‚   └── 04_ordination.R
β”‚
β”œβ”€β”€ output/                         ← Tables, results
β”‚   └── diversity_table.csv
β”‚
└── figures/                        ← Saved plots
    β”œβ”€β”€ nmds_plot.png
    └── diversity_boxplot.pdf

1.2.2 Why This Matters

# ❌ BAD - Absolute paths that break on other computers:
setwd("C:/Users/John/Desktop/thesis/chapter2/data")
data <- read.csv("beetles.csv")

# ❌ BAD - Files scattered everywhere:
data <- read.csv("C:/Users/John/Downloads/beetles.csv")

# βœ… GOOD - Relative paths from project folder:
data <- read.csv("data/raw/beetles.csv")
# Works on ANY computer with the same folder structure!

# βœ… GOOD - Saving outputs to organized folders:
write.csv(results, "output/diversity_results.csv")
ggsave("figures/nmds_plot.png")

1.2.3 The Raw Data Rule

Never modify files in data/raw/!

Your raw data should remain exactly as you received it. If you need to clean or modify data:

  1. Read from data/raw/
  2. Clean in R
  3. Save to data/processed/
# Read raw data
beetles_raw <- read.csv("data/raw/beetle_survey.csv")

# Clean it
beetles_clean <- beetles_raw %>%      # beetles_clean is the name of the new cleaned object.
  filter(!is.na(abundance)) %>%       
  # This function keeps only the rows that meet a certain condition-keeps only the rows where the abundance is not missing.
  mutate(habitat = tolower(habitat))  
  # This function is used to create a new column or modify an existing one.
  # tolower() function  converts all text in the `habitat` column to **lowercase**.

# Save cleaned version on processed folder
write.csv(beetles_clean, "data/processed/beetles_clean.csv", row.names = FALSE)

1.3 R Essentials Quick Review

1.3.1 Objects and Assignment

# Store values in objects using <-
n_sites <- 12
study_area <- "West Java"
is_dry_season <- TRUE

# View objects by typing their name
n_sites
study_area

# Keyboard shortcut for <- : Alt + - (Windows) or Option + - (Mac)

1.3.2 Functions

# Functions perform actions: function_name(arguments)
mean(c(10, 20, 30))        # Calculate mean
sum(c(1, 2, 3, 4, 5))      # Calculate sum
length(c(5, 10, 15))       # Count elements

# Functions can have multiple arguments
round(3.14159, digits = 2)  # Round to 2 decimal places

1.3.3 Vectors

# Vectors are collections of values
abundances <- c(12, 45, 23, 8, 67, 34)

# Operations on vectors
sum(abundances)            # 189
mean(abundances)           # 31.5
abundances * 2             # Doubles each element

# Indexing: extract elements with [ ]
abundances[1]              # First element: 12
abundances[c(1, 3, 5)]     # Elements 1, 3, 5
abundances[abundances > 30] # Elements greater than 30

1.3.4 Data Frames

# Data frames are like spreadsheets
beetle_data <- data.frame(
  site = c("A", "B", "C", "D"),
  habitat = c("forest", "forest", "grassland", "grassland"),
  abundance = c(45, 32, 67, 51)
)

# Access columns with $
beetle_data$abundance
beetle_data$habitat

# Access rows with [ row , column ]
beetle_data[1, ]           # First row
beetle_data[, 3]           # Third column
beetle_data[beetle_data$habitat == "forest", ]  # Forest rows only

1.4 Loading Packages

# Load packages at the start of EVERY session
library(tidyverse)    # Data wrangling and visualization
library(vegan)        # Community ecology
library(ape)          # PCoA

# If you get "there is no package called 'x'":
# install.packages("x")

1.5 Creating Your First Script

Create a new script for this workshop:

  1. File β†’ New File β†’ R Script
  2. Save as: scripts/01_day1_analysis.R

Script header template:

#============================================================================
# Insect Community Analysis - Day 1
# Author: [Your Name]
# Date: [Today's Date]
# Description: Import, clean, and explore insect survey data
#============================================================================

# Load packages ----
library(tidyverse)
library(vegan)

# Import data ----

# Data exploration ----

# Analysis ----

1.5.1 πŸ’‘ Practice Exercise (5 min)

  1. Create a new script called scripts/practice.R
  2. Add a header with your name and today’s date
  3. Load the tidyverse package
  4. Create a vector called my_counts with values 5, 12, 8, 20, 15
  5. Calculate the mean and save it to an object called avg_count
  6. Save your script!

2 Session 2: Importing & Exploring Data (60 min)

2.1 Importing CSV Files

# Make sure your data file is in data/raw/
# Import with read.csv()
insect_data <- read.csv("data/raw/sample_insect_data.csv")

# Common options for messy data:
insect_data <- read.csv(
  "data/raw/sample_insect_data.csv",
  header = TRUE,                    # First row is column names
  stringsAsFactors = FALSE,         # Keep text as text
  na.strings = c("", "NA", "N/A")   # Recognize these as missing
)

2.2 First Look at Your Data

Always explore data immediately after importing!

# View the first few rows
head(insect_data)

# View in spreadsheet format (capital V!)
View(insect_data)

# Structure: column types and preview
str(insect_data)

# Dimensions: rows Γ— columns
dim(insect_data)
nrow(insect_data)
ncol(insect_data)

# Column names
names(insect_data)

# Summary statistics
summary(insect_data)

2.3 Checking for Problems

# Missing values
sum(is.na(insect_data))           # Total NAs
colSums(is.na(insect_data))       # NAs per column

# Unique values in categorical columns
unique(insect_data$habitat)
unique(insect_data$order)

# Check for unexpected values
table(insect_data$habitat)        # Frequency table
table(insect_data$order)

# Check numeric ranges
range(insect_data$abundance)
summary(insect_data$abundance)

2.4 Basic Subsetting

# Filter rows by condition
forest_data <- insect_data[insect_data$habitat == "forest", ]
coleoptera <- insect_data[insect_data$order == "Coleoptera", ]

# Select specific columns
selected <- insect_data[, c("site", "habitat", "morphospecies", "abundance")]

# Combine conditions
forest_beetles <- insect_data[
  insect_data$habitat == "forest" & insect_data$order == "Coleoptera", 
]

2.4.1 πŸ’‘ Practice Exercise (10 min)

  1. Import sample_insect_data.csv
  2. How many rows and columns does it have?
  3. What are the unique habitat types?
  4. How many records are there for each order?
  5. Are there any missing values?

3 Session 3: Data Wrangling with tidyverse (75 min)

3.1 The Pipe Operator

The pipe %>% chains operations together. Read it as β€œthen”.

library(tidyverse)

# Without pipe (nested, hard to read):
round(mean(sqrt(c(1, 4, 9, 16))), 2)

# With pipe (step by step, easy to read):
c(1, 4, 9, 16) %>%
  sqrt() %>%
  mean() %>%
  round(2)

# Read as: "Take 1,4,9,16, THEN sqrt, THEN mean, THEN round"

Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)

3.2 Core dplyr Verbs

3.2.1 filter(): Keep Rows

# Keep only forest sites
insect_data %>%
  filter(habitat == "forest")

# Multiple conditions (AND)
insect_data %>%
  filter(habitat == "forest", abundance > 10)

# Multiple options (OR)
insect_data %>%
  filter(habitat %in% c("forest", "grassland"))

# Exclude
insect_data %>%
  filter(habitat != "agriculture")

3.2.2 select(): Choose Columns

# Keep specific columns
insect_data %>%
  select(site, habitat, morphospecies, abundance)

# Exclude columns
insect_data %>%
  select(-trap_id, -family)

# Select helpers
insect_data %>%
  select(starts_with("site"))

insect_data %>%
  select(contains("species"))

3.2.3 mutate(): Create New Columns

insect_data %>%
  mutate(
    log_abundance = log(abundance + 1),
    abundance_doubled = abundance * 2,
    habitat_upper = toupper(habitat)
  )

3.2.4 arrange(): Sort Rows

# Ascending
insect_data %>%
  arrange(abundance)

# Descending
insect_data %>%
  arrange(desc(abundance))

# Multiple columns
insect_data %>%
  arrange(habitat, desc(abundance))

3.2.5 summarise(): Calculate Summaries

insect_data %>%
  summarise(
    total_abundance = sum(abundance),
    mean_abundance = mean(abundance),
    n_records = n()
  )

3.2.6 group_by() + summarise(): Summaries by Group

This is extremely powerful!

# Summary by habitat
insect_data %>%
  group_by(habitat) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    n_sites = n_distinct(site),
    .groups = "drop"
  )

# Summary by habitat AND order
insect_data %>%
  group_by(habitat, order) %>%
  summarise(
    mean_abundance = mean(abundance),
    n_species = n_distinct(morphospecies),
    .groups = "drop"
  )

3.3 Reshaping Data

3.3.1 Wide vs Long Format

Long format (one observation per row):

site  | species     | abundance
------|-------------|----------
S01   | Carabus_sp1 | 12
S01   | Carabus_sp2 | 5
S02   | Carabus_sp1 | 18

Wide format (community matrix):

site  | Carabus_sp1 | Carabus_sp2
------|-------------|------------
S01   | 12          | 5
S02   | 18          | 0

3.3.2 pivot_wider(): Long to Wide

# Create community matrix
community_matrix <- insect_data %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,
    values_from = abundance,
    values_fill = 0           # Fill missing with 0
  )

head(community_matrix)

3.3.3 pivot_longer(): Wide to Long

# Convert back to long format
long_data <- community_matrix %>%
  pivot_longer(
    cols = -site,             # All columns except site
    names_to = "species",
    values_to = "abundance"
  )

3.4 Chaining Multiple Operations

# Complete data preparation pipeline
beetle_summary <- insect_data %>%
  # Filter to beetles only
  filter(order == "Coleoptera") %>%
  # Group by site and habitat
  group_by(site, habitat, landscape) %>%
  # Calculate summary statistics
  summarise(
    abundance = sum(abundance),
    richness = n_distinct(morphospecies),
    .groups = "drop"
  ) %>%
  # Add new columns
  mutate(
    log_abundance = log(abundance + 1)
  ) %>%
  # Sort by richness
  arrange(desc(richness))

beetle_summary

3.4.1 πŸ’‘ Practice Exercise (15 min)

Using insect_data:

  1. Filter to keep only Hymenoptera
  2. Group by habitat and calculate total abundance and species richness
  3. Arrange by total abundance (descending)
  4. Which habitat has the most Hymenoptera?

3.5 Session 4: Data Exploration & Cleaning (60 min)

3.6 Exploring Your Data: Finding the Story

Data exploration is detective work. Find the patterns before testing them.

3.6.1 Overall Dataset Summary

# Quick overview
cat("=== DATASET OVERVIEW ===\n")
cat("Total records:", nrow(insect_data), "\n")
cat("Total individuals:", sum(insect_data$abundance), "\n")
cat("Number of sites:", n_distinct(insect_data$site), "\n")
cat("Number of habitats:", n_distinct(insect_data$habitat), "\n")
cat("Number of orders:", n_distinct(insect_data$order), "\n")
cat("Number of morphospecies:", n_distinct(insect_data$morphospecies), "\n")

3.6.2 Summary by Taxonomic Group

order_summary <- insect_data %>%
  group_by(order) %>%
  summarise(
    total_abundance = sum(abundance),
    n_species = n_distinct(morphospecies),
    n_sites = n_distinct(site),
    .groups = "drop"
  ) %>%
  arrange(desc(total_abundance))

order_summary

3.7 Choosing Focal Taxa

Not all groups are suitable for analysis. Choose wisely!

3.7.1 Criteria for Focal Taxa

Criterion Minimum Why
Total abundance β‰₯ 50 Statistical power
Species richness β‰₯ 5 Diversity to analyze
Sites present β‰₯ 50% of sites Not just rare occurrence
Habitat variation CV > 20% Interesting patterns
Ecological relevance High Meaningful interpretation

3.7.2 Evaluating Groups

# Calculate habitat variation
habitat_variation <- insect_data %>%
  group_by(order, habitat) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  group_by(order) %>%
  summarise(
    cv_abundance = sd(abundance) / mean(abundance) * 100,
    .groups = "drop"
  )

# Combine with order summary
order_evaluation <- order_summary %>%
  left_join(habitat_variation, by = "order") %>%
  mutate(
    recommended = total_abundance >= 50 & 
                  n_species >= 5 & 
                  cv_abundance >= 20
  )

order_evaluation

3.7.3 Ecological Considerations for Pitfall Traps

Group Pitfall Suitability Notes
Carabidae (ground beetles) Excellent Well-studied indicators
Formicidae (ants) Excellent Colonial, sensitive to disturbance
Araneae (spiders) Good Predators, hunting guilds
Staphylinidae (rove beetles) Good Diverse decomposers
Orthoptera (grasshoppers) Moderate Vegetation-dependent
Flying insects Poor Undersampled by pitfalls

3.8 Data Cleaning

3.8.1 Common Problems and Solutions

# Check for problems
unique(insect_data$habitat)  # Look for typos, case issues

# Standardize text
clean_data <- insect_data %>%
  mutate(
    habitat = tolower(trimws(habitat)),     # Lowercase, remove spaces
    habitat = case_when(                     # Fix typos
      habitat == "forrest" ~ "forest",
      habitat == "grasland" ~ "grassland",
      TRUE ~ habitat
    )
  )

# Handle impossible values
clean_data <- clean_data %>%
  filter(abundance >= 0) %>%                # Remove negative values
  filter(!is.na(abundance))                 # Remove missing values

3.9 Creating Analysis-Ready Data

3.9.1 Community Matrix

# For Coleoptera analysis
comm_matrix <- insect_data %>%
  filter(order == "Coleoptera") %>%
  group_by(site, morphospecies) %>%
  summarise(abundance = sum(abundance), .groups = "drop") %>%
  pivot_wider(
    names_from = morphospecies,
    values_from = abundance,
    values_fill = 0
  ) %>%
  column_to_rownames("site")

head(comm_matrix)

3.9.2 Environmental Data

# Create matching environmental data
env_data <- insect_data %>%
  select(site, habitat, landscape) %>%
  distinct() %>%
  column_to_rownames("site")

# Ensure same order as community matrix
env_data <- env_data[rownames(comm_matrix), , drop = FALSE]

# Verify alignment
all(rownames(comm_matrix) == rownames(env_data))  # Must be TRUE!

3.9.3 Save Processed Data

# Save for tomorrow
write.csv(comm_matrix, "data/processed/beetle_community_matrix.csv")
write.csv(env_data, "data/processed/environmental_data.csv")

3.9.4 πŸ’‘ Practice Exercise (10 min)

  1. Evaluate which orders in your dataset meet the criteria for analysis
  2. Choose one focal order and create:
    • A community matrix (sites Γ— species)
    • A matching environmental data frame
  3. Save both to data/processed/

END OF DAY 1

Summary

Today you learned:

  • βœ… Project organization and why it matters
  • βœ… Importing and exploring data
  • βœ… Data wrangling with dplyr
  • βœ… Choosing focal taxa
  • βœ… Creating community matrices

Homework (Optional)

  1. Complete any unfinished practice exercises
  2. Try the Day 1 Capstone Exercise (in the Exercises document)
  3. Review the concepts we covered

Tomorrow

  • Visualization with ggplot2
  • Diversity metrics
  • NMDS ordination
  • PERMANOVA

Happy coding and happy bug hunting!