HDS 5.6

Author

Gabriel Issa

library(tidyverse)
library(gapminder)

Begin by loading the tidyverse and gapminder packages in the code chunk above and adding your name as the author.

The dplyr functions group_by() and summarize() (or summarise()) allow you to aggregate your data with respect to a categorical variable. They allow you to produce summary statistics for each level of the categorical variable. We will again use the gapminder data to illustrate. Each code chuck below should start with the original gapminder data frame followed by a sequence of functions to create the new data frame, connected by the pipe operator, |>.

Aggregating Data

Let’s start by creating a data frame consisting of the mean and standard deviation of gdpPercap for each country. Modify this code by filling in the ______ to do so:

gapminder |>
  group_by(country) |>
  summarize(mean = mean(gdpPercap, na.rm = TRUE),
            sd = sd(gdpPercap, na.rm = TRUE))

# A tibble: 142 × 3
   country       mean    sd
   <fct>        <dbl> <dbl>
 1 Afghanistan   803.  108.
 2 Albania      3255. 1192.
 3 Algeria      4426. 1310.
 4 Angola       3607. 1166.
 5 Argentina    8956. 1863.
 6 Australia   19981. 7815.
 7 Austria     20412. 9655.
 8 Bahrain     18078. 5415.
 9 Bangladesh    818.  235.
10 Belgium     19901. 8391.
# ℹ 132 more rows

Now create a data frame that consists of the mean lifeExp, pop, and gdpPercap for each continent in each year:

gapminder |>
  group_by(continent, year) |>
  summarize(mean_life = mean(lifeExp, na.rm = TRUE),
            mean_pop = mean(pop, na.rm = TRUE),
            mean_gdp = mean(gdpPercap, na.rm = TRUE))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by continent and year.
ℹ Output is grouped by continent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(continent, year))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

# A tibble: 60 × 5
# Groups:   continent [5]
   continent  year mean_life  mean_pop mean_gdp
   <fct>     <int>     <dbl>     <dbl>    <dbl>
 1 Africa     1952      39.1  4570010.    1253.
 2 Africa     1957      41.3  5093033.    1385.
 3 Africa     1962      43.3  5702247.    1598.
 4 Africa     1967      45.3  6447875.    2050.
 5 Africa     1972      47.5  7305376.    2340.
 6 Africa     1977      49.6  8328097.    2586.
 7 Africa     1982      51.6  9602857.    2482.
 8 Africa     1987      53.3 11054502.    2283.
 9 Africa     1992      53.6 12674645.    2282.
10 Africa     1997      53.6 14304480.    2379.
# ℹ 50 more rows

Now create the exact same data frame, but using the across() function:

gapminder |>
  group_by(continent, year) |>
    summarize(
      across(
        c(lifeExp,pop,gdpPercap),
        mean, 
        na.rm =TRUE))

Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(c(lifeExp, pop, gdpPercap), mean, na.rm = TRUE)`.
ℹ In group 1: `continent = Africa`, `year = 1952`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by continent and year.
ℹ Output is grouped by continent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(continent, year))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

# A tibble: 60 × 5
# Groups:   continent [5]
   continent  year lifeExp       pop gdpPercap
   <fct>     <int>   <dbl>     <dbl>     <dbl>
 1 Africa     1952    39.1  4570010.     1253.
 2 Africa     1957    41.3  5093033.     1385.
 3 Africa     1962    43.3  5702247.     1598.
 4 Africa     1967    45.3  6447875.     2050.
 5 Africa     1972    47.5  7305376.     2340.
 6 Africa     1977    49.6  8328097.     2586.
 7 Africa     1982    51.6  9602857.     2482.
 8 Africa     1987    53.3 11054502.     2283.
 9 Africa     1992    53.6 12674645.     2282.
10 Africa     1997    53.6 14304480.     2379.
# ℹ 50 more rows

Create a data frame that consists of the mean pop for each continent in 2007 and add variable that is the sample size for each mean:

gapminder |>
  filter (year == "2007") |>
  group_by(continent) |>
  summarize(mean = mean(pop, na.rm = TRUE),
            n = n())

# A tibble: 5 × 3
  continent       mean     n
  <fct>          <dbl> <int>
1 Africa     17875763.    52
2 Americas   35954847.    25
3 Asia      115513752.    33
4 Europe     19536618.    30
5 Oceania    12274974.     2

Create a data frame that consists of the maximum lifeExp for each country:

gapminder |>
  group_by(country) |>
  summarize(max = max(lifeExp, na.rm = TRUE))

# A tibble: 142 × 2
   country       max
   <fct>       <dbl>
 1 Afghanistan  43.8
 2 Albania      76.4
 3 Algeria      72.3
 4 Angola       42.7
 5 Argentina    75.3
 6 Australia    81.2
 7 Austria      79.8
 8 Bahrain      75.6
 9 Bangladesh   64.1
10 Belgium      79.4
# ℹ 132 more rows

It’s now time to learn a new dplyr function. The arrange() function sorts the rows of the data frame by the variable in that function. For example, this code sorts the data by pop, smallest to largest:

gapminder |>
  arrange(pop)

# A tibble: 1,704 × 6
   country               continent  year lifeExp   pop gdpPercap
   <fct>                 <fct>     <int>   <dbl> <int>     <dbl>
 1 Sao Tome and Principe Africa     1952    46.5 60011      880.
 2 Sao Tome and Principe Africa     1957    48.9 61325      861.
 3 Djibouti              Africa     1952    34.8 63149     2670.
 4 Sao Tome and Principe Africa     1962    51.9 65345     1072.
 5 Sao Tome and Principe Africa     1967    54.4 70787     1385.
 6 Djibouti              Africa     1957    37.3 71851     2865.
 7 Sao Tome and Principe Africa     1972    56.5 76595     1533.
 8 Sao Tome and Principe Africa     1977    58.6 86796     1738.
 9 Djibouti              Africa     1962    39.7 89898     3021.
10 Sao Tome and Principe Africa     1982    60.4 98593     1890.
# ℹ 1,694 more rows

If you want to sort from largest to smallest, use the desc() function inside arrange():

gapminder |>
  arrange(desc(pop))

# A tibble: 1,704 × 6
   country continent  year lifeExp        pop gdpPercap
   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
 1 China   Asia       2007    73.0 1318683096     4959.
 2 China   Asia       2002    72.0 1280400000     3119.
 3 China   Asia       1997    70.4 1230075000     2289.
 4 China   Asia       1992    68.7 1164970000     1656.
 5 India   Asia       2007    64.7 1110396331     2452.
 6 China   Asia       1987    67.3 1084035000     1379.
 7 India   Asia       2002    62.9 1034172547     1747.
 8 China   Asia       1982    65.5 1000281000      962.
 9 India   Asia       1997    61.8  959000000     1459.
10 China   Asia       1977    64.0  943455000      741.
# ℹ 1,694 more rows

Now, create a data frame that consists of the maximum lifeExp for each country and then sort the countries from largest maximum lifeExp to the smallest:

gapminder |>
  group_by(country) |>
  summarize(max = max(lifeExp, na.rm = TRUE)) |>
  arrange(desc(max))

# A tibble: 142 × 2
   country            max
   <fct>            <dbl>
 1 Japan             82.6
 2 Hong Kong, China  82.2
 3 Iceland           81.8
 4 Switzerland       81.7
 5 Australia         81.2
 6 Spain             80.9
 7 Sweden            80.9
 8 Israel            80.7
 9 France            80.7
10 Canada            80.7
# ℹ 132 more rows

Lastly, create a data frame that consists of the mean lifeExp for each continent by year and make a line plot of the mean lifeExp with different colors for each continent:

gapminder |>
  group_by(continent, year) |>
  summarize(mean_life = mean(lifeExp, na.rm = TRUE)) |>
  ggplot(aes(x= year, y = mean_life, color =continent)) +
  geom_line(size = 6.7) +
  labs( title = "Mean life expectancy for each continent with no particular size", x = "Year", y = "Life expectancy")

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by continent and year.
ℹ Output is grouped by continent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(continent, year))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.