The plan

A short workshop covering reproducibility and code/data management; data filtering, reshaping, and joining; and summarizing.

Reproducible research and data management (~15 minutes)
Filtering and cleaning data (~30 minutes; the gapminder dataset)
Reshaping and joining data (~30 minutes)
Summarizing and manipulating data (~60 minutes; the babynames dataset)

Slides: https://rpubs.com/bpbond/874167

Requirements

This workshop assumes a basic to intermediate knowledge of R.

If you want to do the hands-on exercises (encouraged!), make sure up-to-date versions of the following packages are installed:

dplyr - fast, flexible tool for working with data frames
tidyr - reshaping and cleaning data
ggplot2 - popular package for visualizing data
gapminder - life expectancy, GDP per capita, and population for 142 countries
babynames - names provided to the SSA 1880-2017

A note about ‘tidyverse’

What we’re working with today–the “tidyverse”–constitute a particular and popular dialect of R, but the principles we’ll go over are broadly applicable.

Like anything else, dplyr has advantages and disadvantages. In particular, I do not recommend using it if you don’t already have experience with base R.

I will point out base R equivalents as we go.

More information: https://github.com/matloff/TidyverseSkeptic

The data.table package is also worth checking out for its speed, stability, and zero dependencies.

Reproducibility and data management

Reproducible research…

We are in the era of collaborative ‘big data’, but even if you work by yourself with ‘little data’ you have to have some skills to deal with those data.

Most fundamentally, your results have to be reproducible.

Your most important collaborator is your future self. It’s important to make a workflow that you can use time and time again, and even pass on to others in such a way that you don’t have to be there to walk them through it. Source

https://xkcd.com/1421/

…is the future

Prepare yourself for the future. Funders, journals, governments, colleagues are all pushing for more reproducibility and openness: open-access journals, open code, and data deposition at the peer review stage and beyond.

It’s a slow but steady ratchet.

Reproducibility generally means scripts tied to open source software with effective data management and archiving.

By ZabMilenko at English Wikipedia, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=49717967

You can’t reproduce

…what you’ve lost. What if you need access to a file as it existed 1, 10, or 100, or 1000 days ago?

Incremental backups (minimum)
Don’t depend on yourself! Has to be automatic.
Version control. A repository holds files and tracks changes: what, by whom, why

Version control

Git (and website GitHub) are the most popular version control tools for use with R, and many other languages. They offer:

Version control
Sharing work with collaborators in a repository
Issue tracking
Public or private code

https://xkcd.com/1597/

Scripts for data analysis

Version control and scripts address two of the biggest problems with managing code and data:

Tracking changes over time
Understanding and reproducing the analytical steps.

Ideally, every step in your analysis is programmatic. This means it is performed by a script that can be read and understood in the future.

Manual steps are hard to reproduce, laborious, and error-prone.

Reproducibility is a process

Don’t let the perfect be the enemy of the good.

Upgrade and improve your workflow and skills over time:

Organizing analyses so that they are reproducible is not easy. It requires diligence and a considerable investment of time: to learn new computational tools, and to organize and document analyses as you go.

But partially reproducible is better than not at all reproducible. Just try to make your next paper or project better organized than the last.

A great and practical guide: http://kbroman.org/steps2rr/

Reproducible research example

A typical project/paper directory for me, slightly idealized:

1-download.R
2-prep_data.R
3-analyze_data.R
4-manuscript_report.Rmd
logs/
output/
rawdata/

This directory contains scripts that are backed up both locally and remotely. It is under version control, so it’s easy to track changes over time.

There’s also targets and make, i.e. fully automated analysis workflows…but that’s a topic for another day.

Filtering and cleaning data

gapminder

In honor of the late Hans Rosling, we’ll use the gapminder dataset.

library(gapminder)
gapminder

## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

Read the help page about gapminder.

Pipes and pipelines

The magrittr package (used by both dplyr and tidyr) provides the %>% operator, which allows us to pipe an object forward into a function or call expression.

Note that x %>% f is usually equivalent to f(x).

print(gapminder)

library(dplyr)
gapminder %>% print()
gagminder %>% head()
gapminder %>% head(n=20)
gapminder %>%
  print() %>%
  summary()    # what is non-piped equivalent?
summary(print(gapminder))

RStudio has a keyboard shortcut for this.

R now (as of v4.1) has its own built-in pipe operator: |>.

dplyr

The dplyr package uses verbs (functions) to operate on tibbles (data frames). When using pipes, this looks like:

some_data_frame %>%

  do_something() %>%

  do_something_else() %>%

  getting_dizzy_from_so_much_doing()

As noted above, pipelines have tradeoffs like any other tool. But they do provide easy-to-read algorithmic steps in many cases.

Let’s go over some of those possible do_something steps.

Filter

Very commonly used, as it lets you filter a dataset by one or more conditions.

gapminder %>% filter(country == "Egypt")
gapminder %>% filter(country == "Egypt", year > 2000) # AND
gapminder %>% filter(country == "Egypt" | year > 2000) # OR
gapminder %>% 
    filter(country %in% c("Egypt", "New Zealand", "Chad"))

This is equivalent to base R’s subset function.

subset(gapminder, country == "Egypt")

Filter

library(ggplot2)
gapminder %>%
  filter(year == 1997) %>%
  ggplot(aes(gdpPercap, lifeExp, size = pop, color = continent)) +
  geom_point() +
  scale_x_log10()

Select

Also extremely useful. Note the different notations for selecting columns:

select(gapminder, pop, year)
gapminder %>% select(pop, year)
# Base R equivalent: gapminder[c("pop", "year")]
gapminder %>% select(-lifeExp, -gdpPercap)
gapminder %>% select(-1)

There are lots of other cool ways to select columns–see ?select.

gapminder %>% select(starts_with("c"))
gapminder %>% select(where(is.numeric))

In base R:

gapminder[c("pop", "year"),]

Exercise: filtering and selecting

Let’s focus on a single country’s data for a bit. Write a pipeline that picks out Egypt data only, removes the continent and country columns, and assigns the result to a variable Egypt. How many rows does the resulting dataset have?

Exercise: filtering and selecting

gapminder %>%
  filter(country == "Egypt") %>%
  select(-continent, -country) ->
  Egypt

## # A tibble: 12 × 4
##     year lifeExp      pop gdpPercap
##    <int>   <dbl>    <int>     <dbl>
##  1  1952    41.9 22223309     1419.
##  2  1957    44.4 25009741     1459.
##  3  1962    47.0 28173309     1693.
##  4  1967    49.3 31681188     1815.
##  5  1972    51.1 34807417     2024.
##  6  1977    53.3 38783863     2785.
##  7  1982    56.0 45681811     3504.
##  8  1987    59.8 52799062     3885.
##  9  1992    63.7 59402198     3795.
## 10  1997    67.2 66134291     4173.
## 11  2002    69.8 73312559     4755.
## 12  2007    71.3 80264543     5581.

Uniting and separating

These functions can be very useful.

library(tidyr)
gapminder %>% unite(coco, country, continent)

## # A tibble: 1,704 × 5
##    coco              year lifeExp      pop gdpPercap
##    <chr>            <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan_Asia  1952    28.8  8425333      779.
##  2 Afghanistan_Asia  1957    30.3  9240934      821.
##  3 Afghanistan_Asia  1962    32.0 10267083      853.
##  4 Afghanistan_Asia  1967    34.0 11537966      836.
##  5 Afghanistan_Asia  1972    36.1 13079460      740.
##  6 Afghanistan_Asia  1977    38.4 14880372      786.
##  7 Afghanistan_Asia  1982    39.9 12881816      978.
##  8 Afghanistan_Asia  1987    40.8 13867957      852.
##  9 Afghanistan_Asia  1992    41.7 16317921      649.
## 10 Afghanistan_Asia  1997    41.8 22227415      635.
## # ℹ 1,694 more rows

gapminder %>%
  unite(coco, country, continent) %>%
  separate(coco,
           into = c("country", "continent"),
           sep = "_")

## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

Renaming and mutating

The mutate function in particular is used a lot in dplyr pipelines.

Egypt %>%
    rename(population = pop) %>% 
    mutate(logpop = log(population))

## # A tibble: 12 × 5
##     year lifeExp population gdpPercap logpop
##    <int>   <dbl>      <int>     <dbl>  <dbl>
##  1  1952    41.9   22223309     1419.   16.9
##  2  1957    44.4   25009741     1459.   17.0
##  3  1962    47.0   28173309     1693.   17.2
##  4  1967    49.3   31681188     1815.   17.3
##  5  1972    51.1   34807417     2024.   17.4
##  6  1977    53.3   38783863     2785.   17.5
##  7  1982    56.0   45681811     3504.   17.6
##  8  1987    59.8   52799062     3885.   17.8
##  9  1992    63.7   59402198     3795.   17.9
## 10  1997    67.2   66134291     4173.   18.0
## 11  2002    69.8   73312559     4755.   18.1
## 12  2007    71.3   80264543     5581.   18.2

# base R:
# gapminder$logpop <- log(pop)
# colnames(gapminder)[colnames(gapminder) == "pop"] <- "population"

Mutating

Several important notes:

You can have multiple assignments within a single mutate; this is more efficient than separate mutate calls
New variables overwrite existing variables of the same name

# This will not work correctly
gapminder %>% 
    mutate(pop = mean(pop),
           pop_sd = sd(pop))

Variables can be removed by setting their value to NULL
There’s also transmute, which adds new variables and drops existing ones

Mutating

Several important notes:

Row-wise operations require special handling:

df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df %>% mutate(m = mean(c(x, y, z)))

## # A tibble: 2 × 4
##       x     y     z     m
##   <int> <int> <int> <dbl>
## 1     1     3     5   3.5
## 2     2     4     6   3.5

df %>% rowwise() %>% mutate(m = mean(c(x, y, z)))

## # A tibble: 2 × 4
## # Rowwise: 
##       x     y     z     m
##   <int> <int> <int> <dbl>
## 1     1     3     5     3
## 2     2     4     6     4

Note that for a large dataset, base R’s rowMeans function will be much faster than this, however. Use the best tool for the job!

Data cleaning

Two handy tidyr functions allow you to fill data based on adjacent values, and fill in NA values.

people <- tibble(who = c("Alice", "Bob", "Carol"), age = c(25, NA, 45))
people %>% fill(age, .direction = "down")

## # A tibble: 3 × 2
##   who     age
##   <chr> <dbl>
## 1 Alice    25
## 2 Bob      25
## 3 Carol    45

people %>% replace_na(list(age = 99))

## # A tibble: 3 × 2
##   who     age
##   <chr> <dbl>
## 1 Alice    25
## 2 Bob      99
## 3 Carol    45

Reshaping and joining data

Reshaping (“pivoting”) data

Let’s start with the simple people data frame, but adding another column:

people$height <- c(160, 170, 180)
people

## # A tibble: 3 × 3
##   who     age height
##   <chr> <dbl>  <dbl>
## 1 Alice    25    160
## 2 Bob      NA    170
## 3 Carol    45    180

Note that there is one column of metadata (name) and two of data (age and height).

Make data longer

Let’s say we want to put people into long (or tidy) format–where every row is a different observation. In other words, we need to collapse multiple columns of data into a single one.

For this we use tidyr::pivot_longer, which at a minimum needs to know: what’s the data source, and what column(s) do we want to pivot?

people %>% pivot_longer(c("age", "height"))

## # A tibble: 6 × 3
##   who   name   value
##   <chr> <chr>  <dbl>
## 1 Alice age       25
## 2 Alice height   160
## 3 Bob   age       NA
## 4 Bob   height   170
## 5 Carol age       45
## 6 Carol height   180

(Note base R has a reshape function, but I find it difficult to use.)

Make data longer

We could also say “pivot everything except for the ‘who’ column”, or provide custom column names in the resulting data frame:

people %>% pivot_longer(-who)

## # A tibble: 6 × 3
##   who   name   value
##   <chr> <chr>  <dbl>
## 1 Alice age       25
## 2 Alice height   160
## 3 Bob   age       NA
## 4 Bob   height   170
## 5 Carol age       45
## 6 Carol height   180

people %>% pivot_longer(-who, names_to = "variable", values_to = "datum")

## # A tibble: 6 × 3
##   who   variable datum
##   <chr> <chr>    <dbl>
## 1 Alice age         25
## 2 Alice height     160
## 3 Bob   age         NA
## 4 Bob   height     170
## 5 Carol age         45
## 6 Carol height     180

Make data wider

To transform data into a wider form, pivot_wider needs to know at least two things: which column holds the new column names, and which the data.

people %>% 
    pivot_longer(-who) %>%
    pivot_wider(names_from = "name")

## # A tibble: 3 × 3
##   who     age height
##   <chr> <dbl>  <dbl>
## 1 Alice    25    160
## 2 Bob      NA    170
## 3 Carol    45    180

Here tidyr has guessed that if the names of the new columns are coming from “name”, then the values are probably coming from the numeric column “value”.

And so we get our original people data frame back.

What happens if we take column names from the “who” column?

Make data wider

Pivoting data wider may result in NA values if not every row-column combination is present in the dataset:

people %>% 
    pivot_wider(names_from = "who", values_from = "age")

## # A tibble: 3 × 4
##   height Alice   Bob Carol
##    <dbl> <dbl> <dbl> <dbl>
## 1    160    25    NA    NA
## 2    170    NA    NA    NA
## 3    180    NA    NA    45

people %>% 
    pivot_wider(names_from = "who", values_from = "age", values_fill = -1)

## # A tibble: 3 × 4
##   height Alice   Bob Carol
##    <dbl> <dbl> <dbl> <dbl>
## 1    160    25    -1    -1
## 2    170    -1    NA    -1
## 3    180    -1    -1    45

Reshaping gapminder data

Time for something a bit more complex. Back to our gapminder-derived Egypt dataset!

Egypt %>% 
    pivot_longer(c(lifeExp, pop, gdpPercap))

## # A tibble: 36 × 3
##     year name           value
##    <int> <chr>          <dbl>
##  1  1952 lifeExp         41.9
##  2  1952 pop       22223309  
##  3  1952 gdpPercap     1419. 
##  4  1957 lifeExp         44.4
##  5  1957 pop       25009741  
##  6  1957 gdpPercap     1459. 
##  7  1962 lifeExp         47.0
##  8  1962 pop       28173309  
##  9  1962 gdpPercap     1693. 
## 10  1967 lifeExp         49.3
## # ℹ 26 more rows

Reshaping gapminder data

Egypt %>%
  pivot_longer(-year) %>%
  ggplot(aes(year, value)) + geom_line() +
   facet_wrap(~name, scales = "free")

Experiment. Why do these do what they do?

Egypt %>% pivot_longer(lifeExp)
Egypt %>% pivot_longer(-lifeExp)

Exercise: reshaping

Our Egypt dataset has one row per year, and three data columns. Write code to transform it to have one row per datum, and each year in a different column. Do this for post-2000 data only.

Hint: start by filtering and then pivoting it into long form.

Exercise: reshaping

Our Egypt dataset has one row per year, and three data columns. Write code to transform it to have one row per datum, and each year in a different column. Do this for post-2000 data only.

Hint: start by filtering and then pivoting it into long form.

Egypt %>% 
    filter(year > 2000) %>% 
    pivot_longer(-year) %>% 
    pivot_wider(names_from = "year")

## # A tibble: 3 × 3
##   name          `2002`     `2007`
##   <chr>          <dbl>      <dbl>
## 1 lifeExp         69.8       71.3
## 2 pop       73312559   80264543  
## 3 gdpPercap     4755.      5581.

Joins

Frequently we have more than one data source:

people

## # A tibble: 3 × 3
##   who     age height
##   <chr> <dbl>  <dbl>
## 1 Alice    25    160
## 2 Bob      NA    170
## 3 Carol    45    180

hometowns <- tibble(who = c("Alice", "Bob", "Dave"),
                    where = c("Abilene", "Billsville", "Darien"))
hometowns

## # A tibble: 3 × 2
##   who   where     
##   <chr> <chr>     
## 1 Alice Abilene   
## 2 Bob   Billsville
## 3 Dave  Darien

Here, we’d like to merge (the base R function) or join these together.

Left joins

Probably the most commonly used join is left_join. This works by including all rows in the first data frame, and any matches it finds from the second:

people %>% left_join(hometowns)

## Joining with `by = join_by(who)`

## # A tibble: 3 × 4
##   who     age height where     
##   <chr> <dbl>  <dbl> <chr>     
## 1 Alice    25    160 Abilene   
## 2 Bob      NA    170 Billsville
## 3 Carol    45    180 <NA>

In other words, it’s a lookup.

Note the message. It’s good practice to specify the common columns that joins should use:

people %>% left_join(hometowns, by = "who")

Left joins

The order matters. Why do these produce different results?

people %>% left_join(hometowns, by = "who")

## # A tibble: 3 × 4
##   who     age height where     
##   <chr> <dbl>  <dbl> <chr>     
## 1 Alice    25    160 Abilene   
## 2 Bob      NA    170 Billsville
## 3 Carol    45    180 <NA>

hometowns %>% left_join(people, by = "who")

## # A tibble: 3 × 4
##   who   where        age height
##   <chr> <chr>      <dbl>  <dbl>
## 1 Alice Abilene       25    160
## 2 Bob   Billsville    NA    170
## 3 Dave  Darien        NA     NA

Other joins

There are many different join operations available.

people %>% right_join(hometowns, by = "who")

## # A tibble: 3 × 4
##   who     age height where     
##   <chr> <dbl>  <dbl> <chr>     
## 1 Alice    25    160 Abilene   
## 2 Bob      NA    170 Billsville
## 3 Dave     NA     NA Darien

people %>% inner_join(hometowns, by = "who")

## # A tibble: 2 × 4
##   who     age height where     
##   <chr> <dbl>  <dbl> <chr>     
## 1 Alice    25    160 Abilene   
## 2 Bob      NA    170 Billsville

people %>% full_join(hometowns, by = "who")

## # A tibble: 4 × 4
##   who     age height where     
##   <chr> <dbl>  <dbl> <chr>     
## 1 Alice    25    160 Abilene   
## 2 Bob      NA    170 Billsville
## 3 Carol    45    180 <NA>      
## 4 Dave     NA     NA Darien

# anti_join returns all rows from x without a match in y;
# semi_join returns all rows from x with a match
people %>% anti_join(hometowns, by = "who")

## # A tibble: 1 × 3
##   who     age height
##   <chr> <dbl>  <dbl>
## 1 Carol    45    180

Summarizing and manipulating data

Thinking back to the typical data pipeline, we often want to summarize data by groups as an intermediate or final step. For example, for each subgroup we might want to:

Compute mean, max, min, etc. (n → 1)
Compute rolling mean and other window functions (n → n)
Fit models and extract their parameters, goodness of fit, etc.

Specific examples:

gapminder: what’s the year of maximum GDP for each country?
babynames: what’s the most common name over time?

Split-apply-combine

These are generally known as split-apply-combine problems.

https://github.com/ramnathv/rblocks/issues/8

Why we’re focusing on dplyr

There are many ways to tackle split-apply-combine in R.

The dplyr package specializes in data frames, but also allows you to work with remote, out-of-memory databases, using exactly the same tools, because it abstracts away how your data is stored.

dplyr is very fast for most, though not all, operations on data frames (tabular data). But again, like any tool, it has some disadvantages too.

https://github.com/tidyverse/dplyr

Grouping

dplyr verbs (functions) become particularly powerful when used in conjunction with groups we define in the dataset. Grouping doesn’t change the data, but instead groups it in preparation for the next operation we perform.

gapminder %>%
  group_by(country)

## # A tibble: 1,704 × 6
## # Groups:   country [142]
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

In base R:

split(gapminder, ~ pop + year) # but this creates a new object

Summarising

Most frequently, that “next operation” is a

summarise (n → 1), or
mutate (n → n):

gapminder %>%
  group_by(country) %>%
  summarise(maxpop = max(pop))

## # A tibble: 142 × 2
##    country        maxpop
##    <fct>           <int>
##  1 Afghanistan  31889923
##  2 Albania       3600523
##  3 Algeria      33333216
##  4 Angola       12420476
##  5 Argentina    40301927
##  6 Australia    20434176
##  7 Austria       8199783
##  8 Bahrain        708573
##  9 Bangladesh  150448339
## 10 Belgium      10392226
## # ℹ 132 more rows

In base R:

aggregate(. ~ country, data = gapminder, max)

Summarising and columns

gapminder %>%
  group_by(continent) %>%
  summarise(meanLifeExp = mean(lifeExp),
            maxpop = max(pop))

## # A tibble: 5 × 3
##   continent meanLifeExp     maxpop
##   <fct>           <dbl>      <int>
## 1 Africa           48.9  135031164
## 2 Americas         64.7  301139947
## 3 Asia             60.1 1318683096
## 4 Europe           71.9   82400996
## 5 Oceania          74.3   20434176

Notice that the only columns left, after summarising, are:

The grouping variable(s), and
The summarised ones.

Summarising and groups

By default, every summary operation removes the last grouping variable, while leaving the others intact:

gapminder %>%
  group_by(continent, country) %>%
  summarise(max_LifeExp = max(lifeExp))

## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

## # A tibble: 142 × 3
## # Groups:   continent [5]
##    continent country                  max_LifeExp
##    <fct>     <fct>                          <dbl>
##  1 Africa    Algeria                         72.3
##  2 Africa    Angola                          42.7
##  3 Africa    Benin                           56.7
##  4 Africa    Botswana                        63.6
##  5 Africa    Burkina Faso                    52.3
##  6 Africa    Burundi                         49.6
##  7 Africa    Cameroon                        55.0
##  8 Africa    Central African Republic        50.5
##  9 Africa    Chad                            51.7
## 10 Africa    Comoros                         65.2
## # ℹ 132 more rows

This makes it easy to do certain kinds of multi-step operations; for example, if we want to know what the mean maximum country life expectancy is by continent:

gapminder %>%
  group_by(continent, country) %>%
  summarise(max_LifeExp = max(lifeExp), .groups = "drop_last") %>% 
  summarise(mean(max_LifeExp))

## # A tibble: 5 × 2
##   continent `mean(max_LifeExp)`
##   <fct>                   <dbl>
## 1 Africa                   57.4
## 2 Americas                 73.6
## 3 Asia                     71.0
## 4 Europe                   77.7
## 5 Oceania                  80.7

You can also re-group as needed–See the .groups argument–or ungroup.

Operating on multiple columns and/or functions

We can apply a function to multiple columns…

gapminder %>%
  group_by(country) %>%
  summarise(across(lifeExp:gdpPercap, max))

## # A tibble: 142 × 4
##    country     lifeExp       pop gdpPercap
##    <fct>         <dbl>     <int>     <dbl>
##  1 Afghanistan    43.8  31889923      978.
##  2 Albania        76.4   3600523     5937.
##  3 Algeria        72.3  33333216     6223.
##  4 Angola         42.7  12420476     5523.
##  5 Argentina      75.3  40301927    12779.
##  6 Australia      81.2  20434176    34435.
##  7 Austria        79.8   8199783    36126.
##  8 Bahrain        75.6    708573    29796.
##  9 Bangladesh     64.1 150448339     1391.
## 10 Belgium        79.4  10392226    33693.
## # ℹ 132 more rows

…or multiple functions to one or more columns:

gapminder %>%
  group_by(country) %>%
  summarise(across(pop, c(min, max)))

## # A tibble: 142 × 3
##    country        pop_1     pop_2
##    <fct>          <int>     <int>
##  1 Afghanistan  8425333  31889923
##  2 Albania      1282697   3600523
##  3 Algeria      9279525  33333216
##  4 Angola       4232095  12420476
##  5 Argentina   17876956  40301927
##  6 Australia    8691212  20434176
##  7 Austria      6927772   8199783
##  8 Bahrain       120447    708573
##  9 Bangladesh  46886859 150448339
## 10 Belgium      8730405  10392226
## # ℹ 132 more rows

More complex summaries

We now have the tools to build up a long pipeline to, e.g., compute the min and max for all numeric variables and produce a table with continent as columns headers, and variable (gdpPercap, lifeExp, pop) and statistic as rows.

# We define this function to end up with nice column names
min_max <- list(
  min = ~min(.x, na.rm = TRUE), 
  max = ~max(.x, na.rm = TRUE)
)

gapminder %>% 
    filter(continent %in% c("Africa", "Americas", "Asia")) %>% 
    group_by(continent) %>% 
    # compute min and max for the numeric columns
    summarise(across(where(is.numeric), min_max)) %>% 
    pivot_longer(-continent) %>% 
    separate(name, into = c("variable", "stat")) %>%
    pivot_wider(names_from = "continent", values_from = "value")

## # A tibble: 8 × 5
##   variable  stat       Africa    Americas         Asia
##   <chr>     <chr>       <dbl>       <dbl>        <dbl>
## 1 year      min        1952        1952         1952  
## 2 year      max        2007        2007         2007  
## 3 lifeExp   min          23.6        37.6         28.8
## 4 lifeExp   max          76.4        80.7         82.6
## 5 pop       min       60011      662850       120447  
## 6 pop       max   135031164   301139947   1318683096  
## 7 gdpPercap min         241.       1202.         331  
## 8 gdpPercap max       21951.      42952.      113523.

Introducing `babynames`

Explore babynames a bit. How many rows, columns does it have? How many unique names?

library(babynames)
babynames

## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # ℹ 1,924,655 more rows

Summarizing babynames

What does this calculate?

babynames %>%
    group_by(year, sex) %>%
    slice_max(prop)

## # A tibble: 276 × 5
## # Groups:   year, sex [276]
##     year sex   name      n   prop
##    <dbl> <chr> <chr> <int>  <dbl>
##  1  1880 F     Mary   7065 0.0724
##  2  1880 M     John   9655 0.0815
##  3  1881 F     Mary   6919 0.0700
##  4  1881 M     John   8769 0.0810
##  5  1882 F     Mary   8148 0.0704
##  6  1882 M     John   9557 0.0783
##  7  1883 F     Mary   8012 0.0667
##  8  1883 M     John   8894 0.0791
##  9  1884 F     Mary   9217 0.0670
## 10  1884 M     John   9388 0.0765
## # ℹ 266 more rows

The various slice_ functions are quite handy.

Summarizing babynames

https://en.wikipedia.org/wiki/Linda_(1946_song)

Exercise: the `babynames` dataset

Load the dataset using library(babynames).

Read its help page. Look at its structure (rows, columns, summary).

Use dplyr to calculate the total number of names in the SSA database for each year and sex, with each sex’s counts in a separate column. Hint: n().

Make a graph or table showing how popular YOUR name has been over time (either its proportion, or rank).

Exercise: the `babynames` dataset

Load the dataset using library(babynames).

Read its help page. Look at its structure (rows, columns, summary).

Use dplyr to calculate the total number of names in the SSA database for each year and sex, with each sex’s counts in a separate column. Hint: n().

Make a graph or table showing how popular YOUR name has been over time (either its proportion, or rank).

babynames %>% 
    group_by(year, sex) %>% 
    summarise(n = n()) %>% 
    pivot_wider(names_from = "sex", values_from = "n")

babynames %>%
    filter(name == "Benjamin") %>%
    ggplot(aes(year, n, color = sex)) +
    geom_point() +
    ggtitle("Benjamin")

Another example: ranking

babynames %>%
    group_by(year, sex) %>%
    mutate(rank = row_number(desc(n))) %>%
    filter(name %in% c("Benjamin", "Rachel")) %>%
    ggplot(aes(year, rank, color = name, shape = sex)) +
    geom_point() + scale_y_log10()

    ggtitle("Benjamin and Rachel")

## $title
## [1] "Benjamin and Rachel"
## 
## attr(,"class")
## [1] "labels"

Model-fitting

The diamonds dataset is included with ggplot2.

ggplot(diamonds, aes(carat, price, color = cut)) + 
    geom_point() + 
    geom_smooth(method = "lm") +
    scale_x_log10() + scale_y_log10()

Model-fitting

This old code now generates a warning. Leaving it here, in the slides, to make sure to mention dplyr’s penchant for changing (and breaking) behavior.

library(broom) # to sweep up -- tidy -- model outputs
diamonds %>% 
    group_by(cut) %>% 
    do(mod = lm(log(price) ~ log(carat), data = .)) %>% 
    # need to change this to `reframe` (no other change needed, thankfully)
    summarise(cut = cut,
              tidy(mod))

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## # A tibble: 10 × 6
##    cut       term        estimate std.error statistic p.value
##    <ord>     <chr>          <dbl>     <dbl>     <dbl>   <dbl>
##  1 Fair      (Intercept)     8.19   0.00805    1017.        0
##  2 Fair      log(carat)      1.49   0.0171       87.2       0
##  3 Good      (Intercept)     8.38   0.00398    2102.        0
##  4 Good      log(carat)      1.74   0.00635     274.        0
##  5 Very Good (Intercept)     8.45   0.00266    3179.        0
##  6 Very Good log(carat)      1.73   0.00382     452.        0
##  7 Premium   (Intercept)     8.43   0.00237    3553.        0
##  8 Premium   log(carat)      1.66   0.00355     467.        0
##  9 Ideal     (Intercept)     8.52   0.00237    3596.        0
## 10 Ideal     log(carat)      1.71   0.00310     550.        0

Important things we didn’t talk about

Getting your data into R
Working with non-text/tabular data
Lists
Handling dates and timestamps
Plotting (though we had many examples)
Efficient to high performance computing
Reproducible documents
Building your own package

Check out slide decks at https://rpubs.com/bpbond

But the tools we’ve covered here can do a lot!

Thank you!

Feedback welcome.

The slides for this presentation are available here: https://rpubs.com/bpbond/874167

The repository with the R code that generated the slides is here: https://github.com/bpbond/R-workshops/

Source?

Data manipulation and analysis using tidyverse R

Introduction

The plan

Requirements

A note about ‘tidyverse’

Reproducibility and data management

Reproducible research…

…is the future

You can’t reproduce

Version control

Scripts for data analysis

Reproducibility is a process

Reproducible research example

Filtering and cleaning data

gapminder

Pipes and pipelines

dplyr

Filter

Filter

Select

Exercise: filtering and selecting

Exercise: filtering and selecting

Uniting and separating

Renaming and mutating

Mutating

Mutating

Data cleaning

Reshaping and joining data

Reshaping (“pivoting”) data

Make data longer

Make data longer

Make data wider

Make data wider

Reshaping gapminder data

Reshaping gapminder data

Exercise: reshaping

Exercise: reshaping

Joins

Left joins

Left joins

Other joins

Summarizing data

Summarizing and manipulating data

Split-apply-combine

Why we’re focusing on dplyr

Grouping

Summarising

Summarising and columns

Summarising and groups

Operating on multiple columns and/or functions

More complex summaries

Introducing babynames

Summarizing babynames

Summarizing babynames

Exercise: the babynames dataset

Exercise: the babynames dataset

Another example: ranking

Model-fitting

Model-fitting

Important things we didn’t talk about

The end

Thank you!

Introducing `babynames`

Exercise: the `babynames` dataset

Exercise: the `babynames` dataset