In spreadsheet software like Excel, a Pivot Table is a tool used to summarize, sort, reorganize, group, count, total, or average data stored in a table.
In R, “pivoting” refers to two distinct but related tasks: 1. Reshaping: Moving data between “Wide” and “Long” formats. 2. Aggregating: Summarizing data (Calculating means, sums, etc.) across categories.
We primarily use the tidyr and dplyr
packages (part of the tidyverse).
Before we summarize data, we must understand the shape of it.
Each subject has one row, and variables are spread across multiple columns. (Great for data entry).
Each row is a single observation. (Great for analysis and plotting).
pivot_longer)Use this when you have column headers that are actually values, not variable names.
We will use a built-in dataset of song rankings.
## # A tibble: 6 × 10
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby Don'… 2000-02-26 87 82 72 77 87 94 99
## 2 2Ge+her The Harde… 2000-09-02 91 87 92 NA NA NA NA
## 3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66 57 54
## 4 3 Doors Down Loser 2000-10-21 76 76 72 69 67 65 55
## 5 504 Boyz Wobble Wo… 2000-04-15 57 34 25 17 17 31 36
## 6 98^0 Give Me J… 2000-08-19 51 39 34 26 26 19 2
The columns wk1, wk2, etc., are “values” of
time. To make this “Tidy”, we pivot:
billboard_long <- billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
head(billboard_long)## # A tibble: 6 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
pivot_wider)Use this to create a summary table or a “report-style” view where categories become column headers.
# Let's take the long data and turn it back to wide
billboard_wide_again <- billboard_long %>%
pivot_wider(
names_from = week,
values_from = rank
)
head(billboard_wide_again[, 1:7])## # A tibble: 6 × 7
## artist track date.entered wk1 wk2 wk3 wk4
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 87 82 72 77
## 2 2Ge+her The Hardest Part Of ... 2000-09-02 91 87 92 NA
## 3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67
## 4 3 Doors Down Loser 2000-10-21 76 76 72 69
## 5 504 Boyz Wobble Wobble 2000-04-15 57 34 25 17
## 6 98^0 Give Me Just One Nig... 2000-08-19 51 39 34 26
In Excel, you drag “Region” to Rows and “Sales” to Values. In R, we
use group_by() and summarize().
The gapminder dataset contains life expectancy, GDP per
capita, and population by country over time.
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
This is exactly what an Excel Pivot Table does.
gapminder_pivot <- gapminder %>%
group_by(continent, year) %>%
summarize(avg_lifeExp = mean(lifeExp)) %>%
pivot_wider(names_from = year, values_from = avg_lifeExp)
knitr::kable(gapminder_pivot, caption = "Pivot Table: Avg Life Expectancy by Continent")| continent | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Africa | 39.13550 | 41.26635 | 43.31944 | 45.33454 | 47.45094 | 49.58042 | 51.59287 | 53.34479 | 53.62958 | 53.59827 | 53.32523 | 54.80604 |
| Americas | 53.27984 | 55.96028 | 58.39876 | 60.41092 | 62.39492 | 64.39156 | 66.22884 | 68.09072 | 69.56836 | 71.15048 | 72.42204 | 73.60812 |
| Asia | 46.31439 | 49.31854 | 51.56322 | 54.66364 | 57.31927 | 59.61056 | 62.61794 | 64.85118 | 66.53721 | 68.02052 | 69.23388 | 70.72848 |
| Europe | 64.40850 | 66.70307 | 68.53923 | 69.73760 | 70.77503 | 71.93777 | 72.80640 | 73.64217 | 74.44010 | 75.50517 | 76.70060 | 77.64860 |
| Oceania | 69.25500 | 70.29500 | 71.08500 | 71.31000 | 71.91000 | 72.85500 | 74.29000 | 75.32000 | 76.94500 | 78.19000 | 79.74000 | 80.71950 |
The reason we prefer Long Format in R is because
ggplot2 requires it to map aesthetics (colors, facets) to
variables.
# We use the LONG version of our summary for plotting
gapminder %>%
group_by(continent, year) %>%
summarize(avg_lifeExp = mean(lifeExp)) %>%
ggplot(aes(x = year, y = avg_lifeExp, color = continent)) +
geom_line(size = 1) +
geom_point() +
labs(title = "Life Expectancy over Time",
subtitle = "Summarized using pivot-style grouping",
y = "Average Life Expectancy",
x = "Year") +
theme_minimal()What if you want to calculate the Mean AND the Median of life expectancy?
complex_pivot <- gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarize(
Mean_LifeExp = mean(lifeExp),
Max_LifeExp = max(lifeExp)
)
complex_pivot## # A tibble: 5 × 3
## continent Mean_LifeExp Max_LifeExp
## <fct> <dbl> <dbl>
## 1 Africa 54.8 76.4
## 2 Americas 73.6 80.7
## 3 Asia 70.7 82.6
## 4 Europe 77.6 81.8
## 5 Oceania 80.7 81.2
| Feature | Excel Pivot | R (Tidyverse) |
|---|---|---|
| Rows | Rows Field | group_by() |
| Values | Values Field (Sum, Avg) | summarize(mean(...)) |
| Columns | Columns Field | pivot_wider() |
| Filter | Filters | filter() |
pivot_longer(): Make data “tall” (Wide \(\rightarrow\) Long).pivot_wider(): Make data “wide” (Long \(\rightarrow\) Wide).group_by() %>% summarize(): The engine that
calculates the numbers.Dataset: msleep (Built-in R dataset
about mammal sleep patterns).
sleep_total for each
vore (carnivore, herbivore, etc.).vore types are column
headers.# Try it yourself!
msleep %>%
drop_na(vore) %>%
group_by(vore) %>%
summarize(avg_sleep = mean(sleep_total)) %>%
pivot_wider(names_from = vore, values_from = avg_sleep)## # A tibble: 1 × 4
## carni herbi insecti omni
## <dbl> <dbl> <dbl> <dbl>
## 1 10.4 9.51 14.9 10.9
```
install.packages(c("tidyverse", "gapminder")) in your
console.gapminder is a
standard in data science education because it’s clean but reflects
real-world trends (increasing life expectancy).