Lesson 03: Tidy Data Principles

Data Preparation — tidyplots Course

Author

tidyplots Course

Published

May 7, 2026

1 What is Tidy Data?

Tidy data has three rules (Wickham, 2014):

  1. Each variable is a column
  2. Each observation is a row
  3. Each value is a cell
Why It Matters for Plotting

tidyplots (like all tidyverse tools) expects tidy data. Wide data must be pivoted to long format before plotting.

2 The palmerpenguins Dataset

library(tidyplots)
library(palmerpenguins)
library(dplyr)
library(tidyr)

head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

3 Inspecting the Data

dplyr::glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

4 Wide vs Long Format

# Wide — each measurement is its own column
wide_eg <- data.frame(
  id = 1:3,
  before = c(10, 12, 9),
  after  = c(14, 15, 13)
)
wide_eg
  id before after
1  1     10    14
2  2     12    15
3  3      9    13
# Long — one column for condition, one for value
long_eg <- wide_eg |>
  pivot_longer(cols = c(before, after),
               names_to  = "time",
               values_to = "score")
long_eg
# A tibble: 6 × 3
     id time   score
  <int> <chr>  <dbl>
1     1 before    10
2     1 after     14
3     2 before    12
4     2 after     15
5     3 before     9
6     3 after     13

5 pivot_longer()

long_eg |>
  tidyplot(x = time, y = score, color = time) |>
  add_mean_bar(alpha = 0.6) |>
  add_data_points()

After pivoting to long format we can plot directly

6 pivot_wider()

long_eg |>
  pivot_wider(names_from  = time,
              values_from = score)
# A tibble: 3 × 3
     id before after
  <int>  <dbl> <dbl>
1     1     10    14
2     2     12    15
3     3      9    13

7 Handling Missing Values

sum(is.na(penguins))       # total NAs
[1] 19
colSums(is.na(penguins))   # NAs per column
          species            island    bill_length_mm     bill_depth_mm 
                0                 0                 2                 2 
flipper_length_mm       body_mass_g               sex              year 
                2                 2                11                 0 
penguins_clean <- penguins |> tidyr::drop_na()
nrow(penguins_clean)
[1] 333

8 Filtering with dplyr

gentoo <- penguins_clean |> filter(species == "Gentoo")
nrow(gentoo)
[1] 119

9 Mutating New Columns

penguins_clean <- penguins_clean |>
  mutate(bill_ratio = bill_length_mm / bill_depth_mm)
head(penguins_clean$bill_ratio, 5)
[1] 2.090909 2.270115 2.238889 1.901554 1.907767

10 Grouping and Summarising

penguins_clean |>
  group_by(species) |>
  summarise(mean_mass = mean(body_mass_g),
            sd_mass   = sd(body_mass_g),
            n         = n())
# A tibble: 3 × 4
  species   mean_mass sd_mass     n
  <fct>         <dbl>   <dbl> <int>
1 Adelie        3706.    459.   146
2 Chinstrap     3733.    384.    68
3 Gentoo        5092.    501.   119

11 Plotting Tidy Penguins Data

penguins_clean |>
  tidyplot(x = bill_length_mm, y = bill_depth_mm, color = species) |>
  add_data_points(size = 1.5, alpha = 0.7)

Bill length vs depth by species

12 Facets from Tidy Data

penguins_clean |>
  tidyplot(x = bill_length_mm, y = body_mass_g, color = species) |>
  add_data_points(size = 1) |>
  split_plot(by = island)

Faceted by island — only possible with tidy data

13 Renaming and Recoding

penguins_clean |>
  mutate(size_class = case_when(
    body_mass_g > 4500 ~ "Large",
    body_mass_g > 3500 ~ "Medium",
    TRUE               ~ "Small"
  )) |>
  count(size_class)
# A tibble: 3 × 2
  size_class     n
  <chr>      <int>
1 Large        112
2 Medium       146
3 Small         75

14 Key Take-aways

Lesson 03 Summary
  1. Tidy data: each variable = column, each observation = row
  2. Use pivot_longer() to convert wide → long
  3. drop_na() removes incomplete rows before plotting
  4. dplyr verbs (filter, mutate, summarise) prepare tidy data

15 Practice Exercise

penguins_clean |>
  tidyplot(x = sex, y = body_mass_g, color = species) |>
  add_mean_bar(alpha = 0.5) |>
  add_sem_errorbar() |>
  add_data_points_jitter(size = 1)

Practice: body mass distribution by sex