HDS 5.6

Author

Ben Lopez

library(tidyverse)
library(gapminder)

Begin by loading the tidyverse and gapminder packages in the code chunk above and adding your name as the author.

The dplyr functions group_by() and summarize() (or summarise()) allow you to aggregate your data with respect to a categorical variable. They allow you to produce summary statistics for each level of the categorical variable. We will again use the gapminder data to illustrate. Each code chuck below should start with the original gapminder data frame followed by a sequence of functions to create the new data frame, connected by the pipe operator, |>.

Aggregating Data

Let’s start by creating a data frame consisting of the mean and standard deviation of gdpPercap for each country. Modify this code by filling in the ______ to do so:

gapminder |>
  group_by(country) |>
  summarize(gdp_mean = mean(gdpPercap, na.rm = TRUE),
            gdp_sd = sd(gdpPercap, na.rm = TRUE))

# A tibble: 142 × 3
   country     gdp_mean gdp_sd
   <fct>          <dbl>  <dbl>
 1 Afghanistan     803.   108.
 2 Albania        3255.  1192.
 3 Algeria        4426.  1310.
 4 Angola         3607.  1166.
 5 Argentina      8956.  1863.
 6 Australia     19981.  7815.
 7 Austria       20412.  9655.
 8 Bahrain       18078.  5415.
 9 Bangladesh      818.   235.
10 Belgium       19901.  8391.
# ℹ 132 more rows

Now create a data frame that consists of the mean lifeExp, pop, and gdpPercap for each continent in each year:

gapminder |>
  group_by(gdpPercap, continent) |>
  summarize(lifeExp_mean = mean(lifeExp, na.rm = TRUE),
            pop_mean = mean(pop, na.rm= TRUE))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by gdpPercap and continent.
ℹ Output is grouped by gdpPercap.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(gdpPercap, continent))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

# A tibble: 1,704 × 4
# Groups:   gdpPercap [1,704]
   gdpPercap continent lifeExp_mean pop_mean
       <dbl> <fct>            <dbl>    <dbl>
 1      241. Africa            45.0 55379852
 2      278. Africa            46.5 64606759
 3      299. Africa            42.1   748747
 4      300. Africa            32.5   580653
 5      312. Africa            42.6 47798986
 6      329. Africa            35.9  1438760
 7      331  Asia              36.3 20092996
 8      336. Africa            45.0   813338
 9      339. Africa            39.0  2445618
10      344. Africa            38.0  1542611
# ℹ 1,694 more rows

Now create the exact same data frame, but using the across() function:

gapminder |>
  group_by(gdpPercap, continent) |>
  summarize(across(c(lifeExp,pop), mean, na.rm =TRUE))

Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(c(lifeExp, pop), mean, na.rm = TRUE)`.
ℹ In group 1: `gdpPercap = 241.1659`, `continent = Africa`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by gdpPercap and continent.
ℹ Output is grouped by gdpPercap.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(gdpPercap, continent))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

# A tibble: 1,704 × 4
# Groups:   gdpPercap [1,704]
   gdpPercap continent lifeExp      pop
       <dbl> <fct>       <dbl>    <dbl>
 1      241. Africa       45.0 55379852
 2      278. Africa       46.5 64606759
 3      299. Africa       42.1   748747
 4      300. Africa       32.5   580653
 5      312. Africa       42.6 47798986
 6      329. Africa       35.9  1438760
 7      331  Asia         36.3 20092996
 8      336. Africa       45.0   813338
 9      339. Africa       39.0  2445618
10      344. Africa       38.0  1542611
# ℹ 1,694 more rows

Create a data frame that consists of the mean pop for each continent in 2007 and add variable that is the sample size for each mean:

gapminder |> 
  group_by(continent) |>
  summarize(pop_mean = mean(pop, na.rm = TRUE))

# A tibble: 5 × 2
  continent  pop_mean
  <fct>         <dbl>
1 Africa     9916003.
2 Americas  24504795.
3 Asia      77038722.
4 Europe    17169765.
5 Oceania    8874672.

Create a data frame that consists of the maximum lifeExp for each country:

gapminder |> 
  group_by(country) |>
  summarize(across(lifeExp, max, na.rm = TRUE))

# A tibble: 142 × 2
   country     lifeExp
   <fct>         <dbl>
 1 Afghanistan    43.8
 2 Albania        76.4
 3 Algeria        72.3
 4 Angola         42.7
 5 Argentina      75.3
 6 Australia      81.2
 7 Austria        79.8
 8 Bahrain        75.6
 9 Bangladesh     64.1
10 Belgium        79.4
# ℹ 132 more rows

It’s now time to learn a new dplyr function. The arrange() function sorts the rows of the data frame by the variable in that function. For example, this code sorts the data by pop, smallest to largest:

gapminder |>
  arrange(pop)

# A tibble: 1,704 × 6
   country               continent  year lifeExp   pop gdpPercap
   <fct>                 <fct>     <int>   <dbl> <int>     <dbl>
 1 Sao Tome and Principe Africa     1952    46.5 60011      880.
 2 Sao Tome and Principe Africa     1957    48.9 61325      861.
 3 Djibouti              Africa     1952    34.8 63149     2670.
 4 Sao Tome and Principe Africa     1962    51.9 65345     1072.
 5 Sao Tome and Principe Africa     1967    54.4 70787     1385.
 6 Djibouti              Africa     1957    37.3 71851     2865.
 7 Sao Tome and Principe Africa     1972    56.5 76595     1533.
 8 Sao Tome and Principe Africa     1977    58.6 86796     1738.
 9 Djibouti              Africa     1962    39.7 89898     3021.
10 Sao Tome and Principe Africa     1982    60.4 98593     1890.
# ℹ 1,694 more rows

If you want to sort from largest to smallest, use the desc() function inside arrange():

gapminder |>
  arrange(desc(pop))

# A tibble: 1,704 × 6
   country continent  year lifeExp        pop gdpPercap
   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
 1 China   Asia       2007    73.0 1318683096     4959.
 2 China   Asia       2002    72.0 1280400000     3119.
 3 China   Asia       1997    70.4 1230075000     2289.
 4 China   Asia       1992    68.7 1164970000     1656.
 5 India   Asia       2007    64.7 1110396331     2452.
 6 China   Asia       1987    67.3 1084035000     1379.
 7 India   Asia       2002    62.9 1034172547     1747.
 8 China   Asia       1982    65.5 1000281000      962.
 9 India   Asia       1997    61.8  959000000     1459.
10 China   Asia       1977    64.0  943455000      741.
# ℹ 1,694 more rows

Now, create a data frame that consists of the maximum lifeExp for each country and then sort the countries from largest maximum lifeExp to the smallest:

gapminder |>
  group_by(country) |>
  summarize(lifeExp_max = max(lifeExp, na.rm = TRUE)) |>
  arrange(desc(lifeExp_max))

# A tibble: 142 × 2
   country          lifeExp_max
   <fct>                  <dbl>
 1 Japan                   82.6
 2 Hong Kong, China        82.2
 3 Iceland                 81.8
 4 Switzerland             81.7
 5 Australia               81.2
 6 Spain                   80.9
 7 Sweden                  80.9
 8 Israel                  80.7
 9 France                  80.7
10 Canada                  80.7
# ℹ 132 more rows

Lastly, create a data frame that consists of the mean lifeExp for each continent by year and make a line plot of the mean lifeExp with different colors for each continent:

gapminder |>
  group_by(continent,year) |>
  summarize(lifeExp_mean = mean(lifeExp, na.rm = TRUE)) |>
  ggplot(mapping = aes(x = year, y = lifeExp_mean, group = continent, color = continent)) +
    geom_line() +
  labs(title = "Line plot for mean life expectancy",
       x = "year",
       y = "lifeExp_mean",
       color = "Continent")

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by continent and year.
ℹ Output is grouped by continent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(continent, year))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.