1 — Setup

Install and load packages

# Run once — skip if already installed
install.packages(c("tidyverse", "palmerpenguins"))
library(tidyverse)
library(palmerpenguins)

Get to know the data

glimpse() is the first thing you should run with any new dataset. It shows every column name, its type, and sample values all at once.

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# First 6 rows in table form
head(penguins)
# Min / max / mean for numerics; counts for factors
summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

Check for missing values

penguins |>
  summarize(across(everything(), ~ sum(is.na(.))))

What to notice before we move on:

  • 344 rows × 8 columns — manageable
  • species, island, sex are <fct> (factors) — not plain character strings. This matters for grouping and plotting.
  • 11 NAs in sex, 2 NAs each in the bill, flipper, and mass columns — real data is messy. We’ll handle these as we go.

2 — filter() & select()

filter() keeps rows that match a condition. select() keeps (or drops) columns. Together they let you carve out exactly the slice of data you need.

Single condition

penguins |>
  filter(species == "Adelie")

Multiple conditions — AND

A comma between conditions means both must be true (equivalent to &).

penguins |>
  filter(species == "Adelie", island == "Dream")

Matching a list with %in%

Much cleaner than writing species == "A" | species == "B".

penguins |>
  filter(species %in% c("Adelie", "Chinstrap"))

Dropping NAs

# The ! means NOT — so !is.na() means "keep rows where sex is NOT missing"
penguins |>
  filter(!is.na(sex))

select() — keep or drop columns

penguins |>
  select(species, island, body_mass_g, sex)
# Prefix with - to drop
penguins |>
  select(-year, -island)
# starts_with(), ends_with(), contains() select by name pattern
penguins |>
  select(species, starts_with("bill"))

Putting it together — a full pipeline

Build the pipe step by step. Read it top to bottom like a recipe.

penguins |>
  filter(species == "Gentoo", !is.na(sex)) |>
  select(species, island, body_mass_g, sex) |>
  arrange(desc(body_mass_g))

Try it

How many Adélie penguins were observed on Torgersen island?
Then: select only the bill columns using starts_with("bill").

# Your code here

3 — mutate() & arrange()

mutate() adds new columns — it never removes existing ones. You can reference a column you just created within the same mutate() call.

Unit conversion and derived variables

penguins |>
  filter(!is.na(body_mass_g)) |>
  mutate(
    mass_kg    = body_mass_g / 1000,
    bill_ratio = bill_length_mm / bill_depth_mm  # how "pointed" is the bill?
  ) |>
  select(species, body_mass_g, mass_kg, bill_ratio)

Logical flags

penguins |>
  filter(!is.na(body_mass_g)) |>
  mutate(big_bird = body_mass_g > 4500) |>
  select(species, body_mass_g, big_bird)

Multi-level categories with case_when()

case_when() is a vectorised if/else. Conditions are checked top to bottom — the first match wins. TRUE at the end is the catch-all “everything else”.

penguins |>
  filter(!is.na(body_mass_g), !is.na(sex)) |>
  mutate(
    mass_kg  = body_mass_g / 1000,
    big_bird = body_mass_g > 4500,
    size     = case_when(
      body_mass_g < 3500 ~ "small",
      body_mass_g < 4500 ~ "medium",
      TRUE               ~ "large"    # TRUE = "everything else"
    )
  ) |>
  select(species, sex, mass_kg, big_bird, size) |>
  arrange(desc(mass_kg))

arrange() — sorting rows

penguins |>
  filter(!is.na(body_mass_g)) |>
  mutate(mass_kg = body_mass_g / 1000) |>
  arrange(species, desc(mass_kg)) |>   # sort by species A-Z, then heaviest first within each
  select(species, island, sex, mass_kg)

Try it

Add a column called long_flipper that is TRUE when flipper_length_mm > 200.
Which species has the most long-flippered penguins?

# Your code here

4 — group_by() + summarize()

This is the workhorse combination of data analysis. group_by() invisibly splits the data into groups. summarize() collapses each group down to one summary row.

Summary stats by species

penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  summarize(
    n          = n(),
    mean_mass  = mean(body_mass_g),
    sd_mass    = sd(body_mass_g),
    mean_bill  = mean(bill_length_mm, na.rm = TRUE),
    pct_female = mean(sex == "female", na.rm = TRUE)
  )

Discussion: Which species is heaviest on average? Which has the most variation?

Group by two variables — species × island

penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species, island) |>
  summarize(
    n         = n(),
    mean_mass = mean(body_mass_g),
    .groups   = "drop"   # always drop grouping after summarize — avoids surprises later
  ) |>
  arrange(desc(mean_mass))

Notice: Not every species–island combination appears. Some species only live on certain islands — that’s a real biological fact, not a data quality issue.

Species × sex — who’s heaviest?

penguins |>
  filter(!is.na(body_mass_g), !is.na(sex)) |>
  group_by(species, sex) |>
  summarize(
    n         = n(),
    mean_mass = mean(body_mass_g),
    min_mass  = min(body_mass_g),
    max_mass  = max(body_mass_g),
    .groups   = "drop"
  ) |>
  arrange(species, sex)

count() — a handy shortcut

count(x) is shorthand for group_by(x) |> summarize(n = n()).

penguins |>
  count(species, island, sort = TRUE)

Try it

What is the average flipper length per species per year?
Did it change over 2007–2009?

# Your code here

5 — pivot_longer() + ggplot2

We want to compare bill_length_mm and bill_depth_mm side by side across species. ggplot2 needs the data in long format to do this — one row per individual measurement. I won’t cover how to use ggplot here beyond this example code but there will be other workshops that do.

Why we need to reshape

Right now the data is wide — each bill measurement is its own column.

penguins |>
  filter(!is.na(bill_length_mm)) |>
  select(species, bill_length_mm, bill_depth_mm) |>
  head(6)

pivot_longer() — wide to long

penguins_long <- penguins |>
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
  select(species, sex, bill_length_mm, bill_depth_mm) |>
  pivot_longer(
    cols      = starts_with("bill"),  # which columns to stack
    names_to  = "measurement",        # new col: old column names go here
    values_to = "value_mm"            # new col: values go here
  )

head(penguins_long, 10)

Plot the long data with ggplot2

ggplot(penguins_long, aes(x = species, y = value_mm, fill = species)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ measurement, scales = "free_y") +
  labs(
    title    = "Bill measurements by species",
    subtitle = "Length and depth tell different stories",
    x        = NULL,
    y        = "mm"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

What to look for:
Chinstrap and Adélie have very similar bill depth but very different bill length. Gentoo is clearly separated on depth. This is why both dimensions matter for species identification.


6 — The Full Pipeline

Here’s everything we covered today in one connected chain — no intermediate objects, reads top to bottom.

penguins |>
  # 1. Filter — complete cases, two species only
  filter(
    !is.na(body_mass_g),
    !is.na(sex),
    species %in% c("Adelie", "Gentoo")
  ) |>
  # 2. Mutate — derived columns
  mutate(
    mass_kg = body_mass_g / 1000,
    size    = case_when(
      body_mass_g < 3500 ~ "small",
      body_mass_g < 4500 ~ "medium",
      TRUE               ~ "large"
    )
  ) |>
  # 3. Group and summarize
  group_by(species, sex, size) |>
  summarize(
    n         = n(),
    mean_mass = mean(mass_kg),
    .groups   = "drop"
  ) |>
  # 4. Sort to read the pattern
  arrange(species, sex, desc(mean_mass))

Bonus Challenges

If you finish early or want to practice at home:

1. Yearly trends
Has the average body mass of each species changed across 2007, 2008, and 2009?
Try group_by(species, year) |> summarize(...) and then a line plot with geom_line().

2. Bill ratio by sex
Calculate bill_ratio = bill_length_mm / bill_depth_mm. Do males and females differ on this ratio within the same species?

3. Island breakdown
Which island has the heaviest penguins on average? Does the answer change when you control for species?