# Run once — skip if already installed
install.packages(c("tidyverse", "palmerpenguins"))library(tidyverse)
library(palmerpenguins)glimpse() is the first thing you should run with any new
dataset. It shows every column name, its type, and sample values all at
once.
glimpse(penguins)## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# First 6 rows in table form
head(penguins)# Min / max / mean for numerics; counts for factors
summary(penguins)## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
penguins |>
summarize(across(everything(), ~ sum(is.na(.))))What to notice before we move on:
- 344 rows × 8 columns — manageable
species,island,sexare<fct>(factors) — not plain character strings. This matters for grouping and plotting.- 11 NAs in
sex, 2 NAs each in the bill, flipper, and mass columns — real data is messy. We’ll handle these as we go.
filter() keeps rows that match a condition.
select() keeps (or drops) columns. Together they let you
carve out exactly the slice of data you need.
penguins |>
filter(species == "Adelie")A comma between conditions means both must be true
(equivalent to &).
penguins |>
filter(species == "Adelie", island == "Dream")%in%Much cleaner than writing
species == "A" | species == "B".
penguins |>
filter(species %in% c("Adelie", "Chinstrap"))# The ! means NOT — so !is.na() means "keep rows where sex is NOT missing"
penguins |>
filter(!is.na(sex))penguins |>
select(species, island, body_mass_g, sex)# Prefix with - to drop
penguins |>
select(-year, -island)# starts_with(), ends_with(), contains() select by name pattern
penguins |>
select(species, starts_with("bill"))Build the pipe step by step. Read it top to bottom like a recipe.
penguins |>
filter(species == "Gentoo", !is.na(sex)) |>
select(species, island, body_mass_g, sex) |>
arrange(desc(body_mass_g))How many Adélie penguins were observed on
Torgersen island?
Then: select only the bill columns using
starts_with("bill").
# Your code heremutate() adds new columns — it never removes existing
ones. You can reference a column you just created within the same
mutate() call.
penguins |>
filter(!is.na(body_mass_g)) |>
mutate(
mass_kg = body_mass_g / 1000,
bill_ratio = bill_length_mm / bill_depth_mm # how "pointed" is the bill?
) |>
select(species, body_mass_g, mass_kg, bill_ratio)penguins |>
filter(!is.na(body_mass_g)) |>
mutate(big_bird = body_mass_g > 4500) |>
select(species, body_mass_g, big_bird)case_when()case_when() is a vectorised if/else. Conditions are
checked top to bottom — the first match wins.
TRUE at the end is the catch-all “everything else”.
penguins |>
filter(!is.na(body_mass_g), !is.na(sex)) |>
mutate(
mass_kg = body_mass_g / 1000,
big_bird = body_mass_g > 4500,
size = case_when(
body_mass_g < 3500 ~ "small",
body_mass_g < 4500 ~ "medium",
TRUE ~ "large" # TRUE = "everything else"
)
) |>
select(species, sex, mass_kg, big_bird, size) |>
arrange(desc(mass_kg))penguins |>
filter(!is.na(body_mass_g)) |>
mutate(mass_kg = body_mass_g / 1000) |>
arrange(species, desc(mass_kg)) |> # sort by species A-Z, then heaviest first within each
select(species, island, sex, mass_kg)Add a column called long_flipper that is
TRUE when flipper_length_mm > 200.
Which species has the most long-flippered penguins?
# Your code hereThis is the workhorse combination of data analysis.
group_by() invisibly splits the data into groups.
summarize() collapses each group down to one
summary row.
penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species) |>
summarize(
n = n(),
mean_mass = mean(body_mass_g),
sd_mass = sd(body_mass_g),
mean_bill = mean(bill_length_mm, na.rm = TRUE),
pct_female = mean(sex == "female", na.rm = TRUE)
)Discussion: Which species is heaviest on average? Which has the most variation?
penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species, island) |>
summarize(
n = n(),
mean_mass = mean(body_mass_g),
.groups = "drop" # always drop grouping after summarize — avoids surprises later
) |>
arrange(desc(mean_mass))Notice: Not every species–island combination appears. Some species only live on certain islands — that’s a real biological fact, not a data quality issue.
penguins |>
filter(!is.na(body_mass_g), !is.na(sex)) |>
group_by(species, sex) |>
summarize(
n = n(),
mean_mass = mean(body_mass_g),
min_mass = min(body_mass_g),
max_mass = max(body_mass_g),
.groups = "drop"
) |>
arrange(species, sex)count(x) is shorthand for
group_by(x) |> summarize(n = n()).
penguins |>
count(species, island, sort = TRUE)What is the average flipper length per species per
year?
Did it change over 2007–2009?
# Your code hereWe want to compare bill_length_mm and
bill_depth_mm side by side across species.
ggplot2 needs the data in long format to
do this — one row per individual measurement. I won’t cover how to use
ggplot here beyond this example code but there will be other workshops
that do.
Right now the data is wide — each bill measurement is its own column.
penguins |>
filter(!is.na(bill_length_mm)) |>
select(species, bill_length_mm, bill_depth_mm) |>
head(6)penguins_long <- penguins |>
filter(!is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
select(species, sex, bill_length_mm, bill_depth_mm) |>
pivot_longer(
cols = starts_with("bill"), # which columns to stack
names_to = "measurement", # new col: old column names go here
values_to = "value_mm" # new col: values go here
)
head(penguins_long, 10)ggplot(penguins_long, aes(x = species, y = value_mm, fill = species)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~ measurement, scales = "free_y") +
labs(
title = "Bill measurements by species",
subtitle = "Length and depth tell different stories",
x = NULL,
y = "mm"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")What to look for:
Chinstrap and Adélie have very similar bill depth but very different bill length. Gentoo is clearly separated on depth. This is why both dimensions matter for species identification.
Here’s everything we covered today in one connected chain — no intermediate objects, reads top to bottom.
penguins |>
# 1. Filter — complete cases, two species only
filter(
!is.na(body_mass_g),
!is.na(sex),
species %in% c("Adelie", "Gentoo")
) |>
# 2. Mutate — derived columns
mutate(
mass_kg = body_mass_g / 1000,
size = case_when(
body_mass_g < 3500 ~ "small",
body_mass_g < 4500 ~ "medium",
TRUE ~ "large"
)
) |>
# 3. Group and summarize
group_by(species, sex, size) |>
summarize(
n = n(),
mean_mass = mean(mass_kg),
.groups = "drop"
) |>
# 4. Sort to read the pattern
arrange(species, sex, desc(mean_mass))If you finish early or want to practice at home:
1. Yearly trends
Has the average body mass of each species changed across 2007, 2008, and
2009?
Try group_by(species, year) |> summarize(...) and then a
line plot with geom_line().
2. Bill ratio by sex
Calculate bill_ratio = bill_length_mm / bill_depth_mm. Do
males and females differ on this ratio within the same
species?
3. Island breakdown
Which island has the heaviest penguins on average? Does the answer
change when you control for species?