library(tidyverse)
library(gapminder)HDS 5.6
Begin by loading the tidyverse and gapminder packages in the code chunk above and adding your name as the author.
The dplyr functions group_by() and summarize() (or summarise()) allow you to aggregate your data with respect to a categorical variable. They allow you to produce summary statistics for each level of the categorical variable. We will again use the gapminder data to illustrate. Each code chuck below should start with the original gapminder data frame followed by a sequence of functions to create the new data frame, connected by the pipe operator, |>.
Aggregating Data
Let’s start by creating a data frame consisting of the mean and standard deviation of gdpPercap for each country. Modify this code by filling in the ______ to do so:
gapminder |>
group_by(country) |>
summarize(mean_gdpPercap = mean(gdpPercap, na.rm = TRUE),
sd_gdpPercap = sd(gdpPercap, na.rm = TRUE))# A tibble: 142 × 3
country mean_gdpPercap sd_gdpPercap
<fct> <dbl> <dbl>
1 Afghanistan 803. 108.
2 Albania 3255. 1192.
3 Algeria 4426. 1310.
4 Angola 3607. 1166.
5 Argentina 8956. 1863.
6 Australia 19981. 7815.
7 Austria 20412. 9655.
8 Bahrain 18078. 5415.
9 Bangladesh 818. 235.
10 Belgium 19901. 8391.
# ℹ 132 more rows
Now create a data frame that consists of the mean lifeExp, pop, and gdpPercap for each continent in each year:
gapminder |>
group_by(continent, year) |>
summarize(mean_lifeExp = mean(lifeExp, na.rm = TRUE),
mean_pop = mean(pop, na.rm = TRUE),
mean_gdpPercap = mean(gdpPercap, na.rm = TRUE))`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by continent and year.
ℹ Output is grouped by continent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(continent, year))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
# A tibble: 60 × 5
# Groups: continent [5]
continent year mean_lifeExp mean_pop mean_gdpPercap
<fct> <int> <dbl> <dbl> <dbl>
1 Africa 1952 39.1 4570010. 1253.
2 Africa 1957 41.3 5093033. 1385.
3 Africa 1962 43.3 5702247. 1598.
4 Africa 1967 45.3 6447875. 2050.
5 Africa 1972 47.5 7305376. 2340.
6 Africa 1977 49.6 8328097. 2586.
7 Africa 1982 51.6 9602857. 2482.
8 Africa 1987 53.3 11054502. 2283.
9 Africa 1992 53.6 12674645. 2282.
10 Africa 1997 53.6 14304480. 2379.
# ℹ 50 more rows
Now create the exact same data frame, but using the across() function:
gapminder |>
group_by(continent, year) |>
summarize(across(c(lifeExp, pop, gdpPercap),
~ mean(.x, na.rm = TRUE),
.names = "mean_{.col}"))`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by continent and year.
ℹ Output is grouped by continent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(continent, year))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
# A tibble: 60 × 5
# Groups: continent [5]
continent year mean_lifeExp mean_pop mean_gdpPercap
<fct> <int> <dbl> <dbl> <dbl>
1 Africa 1952 39.1 4570010. 1253.
2 Africa 1957 41.3 5093033. 1385.
3 Africa 1962 43.3 5702247. 1598.
4 Africa 1967 45.3 6447875. 2050.
5 Africa 1972 47.5 7305376. 2340.
6 Africa 1977 49.6 8328097. 2586.
7 Africa 1982 51.6 9602857. 2482.
8 Africa 1987 53.3 11054502. 2283.
9 Africa 1992 53.6 12674645. 2282.
10 Africa 1997 53.6 14304480. 2379.
# ℹ 50 more rows
Create a data frame that consists of the mean pop for each continent in 2007 and add variable that is the sample size for each mean:
gapminder |>
filter(year == 2007) |>
group_by(continent) |>
summarize(mean_pop = mean(pop, na.rm = TRUE),
n = n())# A tibble: 5 × 3
continent mean_pop n
<fct> <dbl> <int>
1 Africa 17875763. 52
2 Americas 35954847. 25
3 Asia 115513752. 33
4 Europe 19536618. 30
5 Oceania 12274974. 2
Create a data frame that consists of the maximum lifeExp for each country:
gapminder |>
group_by(country) |>
summarize(max_lifeExp = max(lifeExp, na.rm = TRUE))# A tibble: 142 × 2
country max_lifeExp
<fct> <dbl>
1 Afghanistan 43.8
2 Albania 76.4
3 Algeria 72.3
4 Angola 42.7
5 Argentina 75.3
6 Australia 81.2
7 Austria 79.8
8 Bahrain 75.6
9 Bangladesh 64.1
10 Belgium 79.4
# ℹ 132 more rows
It’s now time to learn a new dplyr function. The arrange() function sorts the rows of the data frame by the variable in that function. For example, this code sorts the data by pop, smallest to largest:
gapminder |>
arrange(pop)# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Sao Tome and Principe Africa 1952 46.5 60011 880.
2 Sao Tome and Principe Africa 1957 48.9 61325 861.
3 Djibouti Africa 1952 34.8 63149 2670.
4 Sao Tome and Principe Africa 1962 51.9 65345 1072.
5 Sao Tome and Principe Africa 1967 54.4 70787 1385.
6 Djibouti Africa 1957 37.3 71851 2865.
7 Sao Tome and Principe Africa 1972 56.5 76595 1533.
8 Sao Tome and Principe Africa 1977 58.6 86796 1738.
9 Djibouti Africa 1962 39.7 89898 3021.
10 Sao Tome and Principe Africa 1982 60.4 98593 1890.
# ℹ 1,694 more rows
If you want to sort from largest to smallest, use the desc() function inside arrange():
gapminder |>
arrange(desc(pop))# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 China Asia 2007 73.0 1318683096 4959.
2 China Asia 2002 72.0 1280400000 3119.
3 China Asia 1997 70.4 1230075000 2289.
4 China Asia 1992 68.7 1164970000 1656.
5 India Asia 2007 64.7 1110396331 2452.
6 China Asia 1987 67.3 1084035000 1379.
7 India Asia 2002 62.9 1034172547 1747.
8 China Asia 1982 65.5 1000281000 962.
9 India Asia 1997 61.8 959000000 1459.
10 China Asia 1977 64.0 943455000 741.
# ℹ 1,694 more rows
Now, create a data frame that consists of the maximum lifeExp for each country and then sort the countries from largest maximum lifeExp to the smallest:
gapminder |>
group_by(country) |>
summarize(max_lifeExp = max(lifeExp, na.rm = TRUE)) |>
arrange(desc(max_lifeExp))# A tibble: 142 × 2
country max_lifeExp
<fct> <dbl>
1 Japan 82.6
2 Hong Kong, China 82.2
3 Iceland 81.8
4 Switzerland 81.7
5 Australia 81.2
6 Spain 80.9
7 Sweden 80.9
8 Israel 80.7
9 France 80.7
10 Canada 80.7
# ℹ 132 more rows
Lastly, create a data frame that consists of the mean lifeExp for each continent by year and make a line plot of the mean lifeExp with different colors for each continent:
gapminder |>
group_by(continent, year) |>
summarize(mean_lifeExp = mean(lifeExp, na.rm = TRUE)) |>
ggplot(aes(x = year, y = mean_lifeExp, color = continent)) +
geom_line() +
labs(title = "Mean Life Expectancy by Continent",
x = "Year",
y = "Mean Life Expectancy")`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by continent and year.
ℹ Output is grouped by continent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(continent, year))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.