HDS 5.6

Author

Sandeep Thapa Chhetri

library(tidyverse)
library(gapminder)

Begin by loading the tidyverse and gapminder packages in the code chunk above and adding your name as the author.

The dplyr functions group_by() and summarize() (or summarise()) allow you to aggregate your data with respect to a categorical variable. They allow you to produce summary statistics for each level of the categorical variable. We will again use the gapminder data to illustrate. Each code chuck below should start with the original gapminder data frame followed by a sequence of functions to create the new data frame, connected by the pipe operator, |>.

Aggregating Data

Let’s start by creating a data frame consisting of the mean and standard deviation of gdpPercap for each country. Modify this code by filling in the ______ to do so:

gapminder |>
  group_by(country) |>
  summarize(mean_gdpPercap = mean(gdpPercap, na.rm = TRUE),
            sd_gdpPercap   = sd(gdpPercap, na.rm = TRUE))

# A tibble: 142 × 3
   country     mean_gdpPercap sd_gdpPercap
   <fct>                <dbl>        <dbl>
 1 Afghanistan           803.         108.
 2 Albania              3255.        1192.
 3 Algeria              4426.        1310.
 4 Angola               3607.        1166.
 5 Argentina            8956.        1863.
 6 Australia           19981.        7815.
 7 Austria             20412.        9655.
 8 Bahrain             18078.        5415.
 9 Bangladesh            818.         235.
10 Belgium             19901.        8391.
# ℹ 132 more rows

Now create a data frame that consists of the mean lifeExp, pop, and gdpPercap for each continent in each year:

gapminder |>
  group_by(continent, year) |>
  summarize(mean_lifeExp   = mean(lifeExp, na.rm = TRUE),
            mean_pop       = mean(pop, na.rm = TRUE),
            mean_gdpPercap = mean(gdpPercap, na.rm = TRUE))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by continent and year.
ℹ Output is grouped by continent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(continent, year))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

# A tibble: 60 × 5
# Groups:   continent [5]
   continent  year mean_lifeExp  mean_pop mean_gdpPercap
   <fct>     <int>        <dbl>     <dbl>          <dbl>
 1 Africa     1952         39.1  4570010.          1253.
 2 Africa     1957         41.3  5093033.          1385.
 3 Africa     1962         43.3  5702247.          1598.
 4 Africa     1967         45.3  6447875.          2050.
 5 Africa     1972         47.5  7305376.          2340.
 6 Africa     1977         49.6  8328097.          2586.
 7 Africa     1982         51.6  9602857.          2482.
 8 Africa     1987         53.3 11054502.          2283.
 9 Africa     1992         53.6 12674645.          2282.
10 Africa     1997         53.6 14304480.          2379.
# ℹ 50 more rows

Now create the exact same data frame, but using the across() function:

gapminder |>
  group_by(continent, year) |>
  summarize(across(c(lifeExp, pop, gdpPercap),
                   ~ mean(.x, na.rm = TRUE),
                   .names = "mean_{.col}"))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by continent and year.
ℹ Output is grouped by continent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(continent, year))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

# A tibble: 60 × 5
# Groups:   continent [5]
   continent  year mean_lifeExp  mean_pop mean_gdpPercap
   <fct>     <int>        <dbl>     <dbl>          <dbl>
 1 Africa     1952         39.1  4570010.          1253.
 2 Africa     1957         41.3  5093033.          1385.
 3 Africa     1962         43.3  5702247.          1598.
 4 Africa     1967         45.3  6447875.          2050.
 5 Africa     1972         47.5  7305376.          2340.
 6 Africa     1977         49.6  8328097.          2586.
 7 Africa     1982         51.6  9602857.          2482.
 8 Africa     1987         53.3 11054502.          2283.
 9 Africa     1992         53.6 12674645.          2282.
10 Africa     1997         53.6 14304480.          2379.
# ℹ 50 more rows

Create a data frame that consists of the mean pop for each continent in 2007 and add variable that is the sample size for each mean:

gapminder |>
  filter(year == 2007) |>
  group_by(continent) |>
  summarize(mean_pop = mean(pop, na.rm = TRUE),
            n = n())

# A tibble: 5 × 3
  continent   mean_pop     n
  <fct>          <dbl> <int>
1 Africa     17875763.    52
2 Americas   35954847.    25
3 Asia      115513752.    33
4 Europe     19536618.    30
5 Oceania    12274974.     2

Create a data frame that consists of the maximum lifeExp for each country:

gapminder |>
  group_by(country) |>
  summarize(max_lifeExp = max(lifeExp, na.rm = TRUE))

# A tibble: 142 × 2
   country     max_lifeExp
   <fct>             <dbl>
 1 Afghanistan        43.8
 2 Albania            76.4
 3 Algeria            72.3
 4 Angola             42.7
 5 Argentina          75.3
 6 Australia          81.2
 7 Austria            79.8
 8 Bahrain            75.6
 9 Bangladesh         64.1
10 Belgium            79.4
# ℹ 132 more rows

It’s now time to learn a new dplyr function. The arrange() function sorts the rows of the data frame by the variable in that function. For example, this code sorts the data by pop, smallest to largest:

gapminder |>
  arrange(pop)

# A tibble: 1,704 × 6
   country               continent  year lifeExp   pop gdpPercap
   <fct>                 <fct>     <int>   <dbl> <int>     <dbl>
 1 Sao Tome and Principe Africa     1952    46.5 60011      880.
 2 Sao Tome and Principe Africa     1957    48.9 61325      861.
 3 Djibouti              Africa     1952    34.8 63149     2670.
 4 Sao Tome and Principe Africa     1962    51.9 65345     1072.
 5 Sao Tome and Principe Africa     1967    54.4 70787     1385.
 6 Djibouti              Africa     1957    37.3 71851     2865.
 7 Sao Tome and Principe Africa     1972    56.5 76595     1533.
 8 Sao Tome and Principe Africa     1977    58.6 86796     1738.
 9 Djibouti              Africa     1962    39.7 89898     3021.
10 Sao Tome and Principe Africa     1982    60.4 98593     1890.
# ℹ 1,694 more rows

If you want to sort from largest to smallest, use the desc() function inside arrange():

gapminder |>
  arrange(desc(pop))

# A tibble: 1,704 × 6
   country continent  year lifeExp        pop gdpPercap
   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
 1 China   Asia       2007    73.0 1318683096     4959.
 2 China   Asia       2002    72.0 1280400000     3119.
 3 China   Asia       1997    70.4 1230075000     2289.
 4 China   Asia       1992    68.7 1164970000     1656.
 5 India   Asia       2007    64.7 1110396331     2452.
 6 China   Asia       1987    67.3 1084035000     1379.
 7 India   Asia       2002    62.9 1034172547     1747.
 8 China   Asia       1982    65.5 1000281000      962.
 9 India   Asia       1997    61.8  959000000     1459.
10 China   Asia       1977    64.0  943455000      741.
# ℹ 1,694 more rows

Now, create a data frame that consists of the maximum lifeExp for each country and then sort the countries from largest maximum lifeExp to the smallest:

gapminder |>
  group_by(country) |>
  summarize(max_lifeExp = max(lifeExp, na.rm = TRUE)) |>
  arrange(desc(max_lifeExp))

# A tibble: 142 × 2
   country          max_lifeExp
   <fct>                  <dbl>
 1 Japan                   82.6
 2 Hong Kong, China        82.2
 3 Iceland                 81.8
 4 Switzerland             81.7
 5 Australia               81.2
 6 Spain                   80.9
 7 Sweden                  80.9
 8 Israel                  80.7
 9 France                  80.7
10 Canada                  80.7
# ℹ 132 more rows

Lastly, create a data frame that consists of the mean lifeExp for each continent by year and make a line plot of the mean lifeExp with different colors for each continent:

gapminder |>
  group_by(continent, year) |>
  summarize(mean_lifeExp = mean(lifeExp, na.rm = TRUE)) |>
  ggplot(aes(x = year, y = mean_lifeExp, color = continent)) +
  geom_line() +
  labs(title = "Mean Life Expectancy by Continent",
       x = "Year",
       y = "Mean Life Expectancy")

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by continent and year.
ℹ Output is grouped by continent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(continent, year))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.