In spreadsheet software like Excel, Pivot Tables are
the go-to tool for summarizing large datasets. In R, we achieve the same
(and much more) using the tidyverse ecosystem, specifically
the dplyr package for aggregation and the
tidyr package for reshaping.
We will use the penguins dataset. It contains
measurements for penguin species, islands in the Palmer Archipelago,
size (flipper length, body mass), and sex.
# Load and view the first few rows
data("penguins")
head(penguins) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
In R, a pivot table summary is created by combining
group_by() and summarize().
Let’s find the average body mass and count of penguins per species.
pivot_summary <- penguins %>%
filter(!is.na(sex)) %>% # Remove rows with missing sex
group_by(species, island) %>%
summarize(
avg_mass_g = mean(body_mass_g, na.rm = TRUE),
count = n()
) %>%
ungroup()
pivot_summary %>%
kable(caption = "Summary Pivot Table") %>%
kable_styling(full_width = F)| species | island | avg_mass_g | count |
|---|---|---|---|
| Adelie | Biscoe | 3709.659 | 44 |
| Adelie | Dream | 3701.364 | 55 |
| Adelie | Torgersen | 3708.511 | 47 |
| Chinstrap | Dream | 3733.088 | 68 |
| Gentoo | Biscoe | 5092.437 | 119 |
Pivot tables are often precursors to visualizations.
ggplot(pivot_summary, aes(x = species, y = avg_mass_g, fill = island)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Average Penguin Body Mass", y = "Mass (g)", x = "Species")Figure 1: Average Body Mass by Species and Island
Often, “pivoting” refers to changing the layout of the data from Long to Wide or vice-versa.
pivot_wider(): Making it ReadableUse pivot_wider() when you want to transform categories
into columns. This is common for creating comparison reports.
Task: Create a table where species are rows, islands are columns, and the values are the average body mass.
wide_table <- pivot_summary %>%
select(species, island, avg_mass_g) %>%
pivot_wider(
names_from = island,
values_from = avg_mass_g
)
wide_table %>%
kable(caption = "Wide Format Table (Human Readable)") %>%
kable_styling(full_width = F)| species | Biscoe | Dream | Torgersen |
|---|---|---|---|
| Adelie | 3709.659 | 3701.364 | 3708.511 |
| Chinstrap | NA | 3733.088 | NA |
| Gentoo | 5092.437 | NA | NA |
pivot_longer(): Making it Analysis-ReadyMost R functions (like ggplot2) require Long
Format data. Let’s assume we have data where “Year” is spread
across columns and we need to collapse it.
Example:
# Creating a dummy 'wide' dataset
untidy_data <- data.frame(
Country = c("USA", "Canada", "Mexico"),
`2020` = c(100, 150, 120),
`2021` = c(120, 160, 130),
check.names = FALSE
)
# Pivoting Longer
tidy_data <- untidy_data %>%
pivot_longer(
cols = c(`2020`, `2021`),
names_to = "Year",
values_to = "GDP_Growth"
)
tidy_data %>% kable() %>% kable_styling(full_width = F)| Country | Year | GDP_Growth |
|---|---|---|
| USA | 2020 | 100 |
| USA | 2021 | 120 |
| Canada | 2020 | 150 |
| Canada | 2021 | 160 |
| Mexico | 2020 | 120 |
| Mexico | 2021 | 130 |
You can pivot multiple value columns at once. Let’s get both the
mean and the max body mass into a wide
format.
complex_pivot <- penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarize(
avg_mass = mean(body_mass_g, na.rm = TRUE),
max_mass = max(body_mass_g, na.rm = TRUE)
) %>%
pivot_wider(
names_from = sex,
values_from = c(avg_mass, max_mass)
)
complex_pivot %>% kable() %>% kable_styling(bootstrap_options = "condensed")| species | avg_mass_female | avg_mass_male | max_mass_female | max_mass_male |
|---|---|---|---|---|
| Adelie | 3368.836 | 4043.493 | 3900 | 4775 |
| Chinstrap | 3527.206 | 3938.971 | 4150 | 4800 |
| Gentoo | 4679.741 | 5484.836 | 5200 | 6300 |
| Goal | Function |
|---|---|
| Group Data | group_by(column_name) |
| Calculate Stats | summarize(new_var = mean(value)) |
| Rows to Columns | pivot_wider(names_from = ..., values_from = ...) |
| Columns to Rows | pivot_longer(cols = ..., names_to = ..., values_to = ...) |
penguins dataset, create a pivot table
showing the median flipper length for each
species and sex.NA values.sex (male/female) are the
column headers.# Try this yourself!
penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarize(med_flipper = median(flipper_length_mm)) %>%
pivot_wider(names_from = sex, values_from = med_flipper)End of Lecture Notes. ```
Pivot_Table_Lecture.Rmd.palmerpenguins
dataset, which is a modern replacement for the Iris dataset.ggplot2 bar
chart to show how pivot results are used in plotting.kable and
kableExtra to make the tables look like professional
lecture handouts.