1 Overview

This document will show how to use R package dplyr to do data wrangling fundamentals for one and two data frames. (If you’re used to SQL, a data frame is just a table.) Throughout this, I’ll show code for accomplishing a task with dplyr, but I’ll also sometimes show how to accomplish the task with base R (in case you’re like me and learned base R first). Before beginning, install the package dplyr by using the GUI or by running the install.packages code that I’ve commented out below.

# install.packages("dplyr")
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.1.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

First, we will focus on data wrangling using a single data frame. After that, we will join two data frames.

2 One Data Frame: Wrangling

For the first section, we’ll use the penguins data from the palmerpenguins package. Read the data documentation and install the package before beginning.

library(palmerpenguins)

## Warning: package 'palmerpenguins' was built under R version 4.1.1

glimpse(penguins)

## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <fct> male, female, female, NA, female, male, female, male~
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

We’ll do the basics first, and then we’ll create more complex sequences of commands. When we have a single data frame, we can think of dplyr mostly operating on rows and columns. A big exception is when we start grouping by a factor (a categorical variable).

2.1 Row Operations

Row operations include slicing, filtering, and arranging (like ORDER BY in SQL).

2.1.1 Slice

When you slice the data, you snag a few rows. Here I grab rows 3 through 6.

penguins %>% slice(3:6)

## # A tibble: 4 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge~           40.3          18                195        3250 fema~
## 2 Adelie  Torge~           NA            NA                 NA          NA <NA> 
## 3 Adelie  Torge~           36.7          19.3              193        3450 fema~
## 4 Adelie  Torge~           39.3          20.6              190        3650 male 
## # ... with 1 more variable: year <int>

If you’re used to base R, then slicing is an alternative to indexing rows using square brackets, like this:

penguins[3:6,]

## # A tibble: 4 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge~           40.3          18                195        3250 fema~
## 2 Adelie  Torge~           NA            NA                 NA          NA <NA> 
## 3 Adelie  Torge~           36.7          19.3              193        3450 fema~
## 4 Adelie  Torge~           39.3          20.6              190        3650 male 
## # ... with 1 more variable: year <int>

Quick comment: we all love cake, but don’t think of cake slices because you slice a cake vertically, and these slices are cutting the data horizontally.

You can use two special commands to see the top or bottom of the data. These are the base R analogs of the head and tail commands, respectively. Just specify how many rows from the top (or bottom) you want.

penguins %>% slice_head(n=4)

## # A tibble: 4 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge~           39.1          18.7              181        3750 male 
## 2 Adelie  Torge~           39.5          17.4              186        3800 fema~
## 3 Adelie  Torge~           40.3          18                195        3250 fema~
## 4 Adelie  Torge~           NA            NA                 NA          NA <NA> 
## # ... with 1 more variable: year <int>

penguins %>% slice_tail(n=5)

## # A tibble: 5 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Chinst~ Dream            55.8          19.8              207        4000 male 
## 2 Chinst~ Dream            43.5          18.1              202        3400 fema~
## 3 Chinst~ Dream            49.6          18.2              193        3775 male 
## 4 Chinst~ Dream            50.8          19                210        4100 male 
## 5 Chinst~ Dream            50.2          18.7              198        3775 fema~
## # ... with 1 more variable: year <int>

Finally, our last slice command takes randomly selected rows. By default, this will sample without replacement. Here we are 4 randomly selected rows.

penguins %>% slice_sample(n=4)

## # A tibble: 4 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Chinst~ Dream            46.8          16.5              189        3650 fema~
## 2 Adelie  Dream            44.1          19.7              196        4400 male 
## 3 Gentoo  Biscoe           44            13.6              208        4350 fema~
## 4 Chinst~ Dream            52            18.1              201        4050 male 
## # ... with 1 more variable: year <int>

If you want to sample with replacement, add the replace = TRUE argument to the command.

penguins %>% slice_sample(n = 3, replace = TRUE)

## # A tibble: 3 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Gentoo  Biscoe           49.8          15.9              229        5950 male 
## 2 Gentoo  Biscoe           43.6          13.9              217        4900 fema~
## 3 Adelie  Biscoe           40.6          18.8              193        3800 male 
## # ... with 1 more variable: year <int>

Advanced note: if you want to create a bootstrap sample, use slice_sample with the original sample size, and save your slice. With tidyverse, I like the forwards arrow for assignments because it follows the usual flow of data (tidyverse code flows to the right).

penguins %>% summarize(n()) -> npenguins
as.numeric(npenguins)

## [1] 344

penguins %>% slice_sample(n = as.numeric(npenguins), replace = TRUE) -> bootstrapsample
glimpse(bootstrapsample)

## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Chinstrap, Gentoo, Gentoo, Adelie, A~
## $ island            <fct> Biscoe, Dream, Dream, Biscoe, Biscoe, Dream, Dream, ~
## $ bill_length_mm    <dbl> 39.0, 39.5, 52.0, 48.4, 43.6, 39.7, 39.5, 38.9, 42.5~
## $ bill_depth_mm     <dbl> 17.5, 17.8, 18.1, 16.3, 13.9, 17.9, 16.7, 18.8, 17.3~
## $ flipper_length_mm <int> 186, 188, 201, 220, 217, 193, 178, 190, 187, 215, 21~
## $ body_mass_g       <int> 3550, 3300, 4050, 5400, 4900, 4250, 3250, 3600, 3350~
## $ sex               <fct> female, female, male, male, female, male, female, fe~
## $ year              <int> 2008, 2007, 2007, 2008, 2008, 2009, 2007, 2008, 2009~

2.1.2 Filter

When you filter your rows, you choose rows based on some column values. For example, you might want to see only the rows for penguins from Dream island. This will output a lot of rows, so I’m going to just get a glimpse of the data.

penguins %>% filter(island == "Dream") %>% glimpse()

## Rows: 124
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Dream, Dream, Dream, Dream, Dream, Dream, Dream, Dre~
## $ bill_length_mm    <dbl> 39.5, 37.2, 39.5, 40.9, 36.4, 39.2, 38.8, 42.2, 37.6~
## $ bill_depth_mm     <dbl> 16.7, 18.1, 17.8, 18.9, 17.0, 21.1, 20.0, 18.5, 19.3~
## $ flipper_length_mm <int> 178, 178, 188, 184, 195, 196, 190, 180, 181, 184, 18~
## $ body_mass_g       <int> 3250, 3900, 3300, 3900, 3325, 4150, 3950, 3550, 3300~
## $ sex               <fct> female, male, female, male, female, male, male, fema~
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

Here’s a glimpse of the penguins with bill lengths exceeding 30 mm.

penguins %>% filter(bill_length_mm < 35) %>% glimpse()

## Rows: 9
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Torgersen, Torgersen, Torgersen, Biscoe, Torgersen, ~
## $ bill_length_mm    <dbl> 34.1, 34.6, 34.4, 34.5, 33.5, 34.6, 34.0, 33.1, 32.1
## $ bill_depth_mm     <dbl> 18.1, 21.1, 18.4, 18.1, 19.0, 17.2, 17.1, 16.1, 15.5
## $ flipper_length_mm <int> 193, 198, 184, 187, 190, 189, 185, 178, 188
## $ body_mass_g       <int> 3475, 4400, 3325, 2900, 3600, 3200, 3400, 2900, 3050
## $ sex               <fct> NA, male, female, female, female, female, female, fe~
## $ year              <int> 2007, 2007, 2007, 2008, 2008, 2008, 2008, 2008, 2009

This is just like square bracket indexing in base R or subsetting with the subset command.

head(penguins[penguins$island == "Dream",])

## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Dream            39.5          16.7              178        3250 fema~
## 2 Adelie  Dream            37.2          18.1              178        3900 male 
## 3 Adelie  Dream            39.5          17.8              188        3300 fema~
## 4 Adelie  Dream            40.9          18.9              184        3900 male 
## 5 Adelie  Dream            36.4          17                195        3325 fema~
## 6 Adelie  Dream            39.2          21.1              196        4150 male 
## # ... with 1 more variable: year <int>

head(subset(penguins, island == "Dream"))

## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Dream            39.5          16.7              178        3250 fema~
## 2 Adelie  Dream            37.2          18.1              178        3900 male 
## 3 Adelie  Dream            39.5          17.8              188        3300 fema~
## 4 Adelie  Dream            40.9          18.9              184        3900 male 
## 5 Adelie  Dream            36.4          17                195        3325 fema~
## 6 Adelie  Dream            39.2          21.1              196        4150 male 
## # ... with 1 more variable: year <int>

You can filter based on a couple criteria by using another pipe and another filter.

penguins %>% filter(island == "Dream") %>% filter(bill_length_mm < 35) %>% glimpse()

## Rows: 3
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie
## $ island            <fct> Dream, Dream, Dream
## $ bill_length_mm    <dbl> 34.0, 33.1, 32.1
## $ bill_depth_mm     <dbl> 17.1, 16.1, 15.5
## $ flipper_length_mm <int> 185, 178, 188
## $ body_mass_g       <int> 3400, 2900, 3050
## $ sex               <fct> female, female, female
## $ year              <int> 2008, 2008, 2009

2.1.3 Arrange

Use arrange to order the rows of the data. This is like ORDER BY in SQL. Let’s order the penguins from lightest to heaviest. Then I’ll slice the head. This will give us the five lightest penguins.

penguins %>% arrange(body_mass_g) %>% slice_head(n=5)

## # A tibble: 5 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Chinst~ Dream            46.9          16.6              192        2700 fema~
## 2 Adelie  Biscoe           36.5          16.6              181        2850 fema~
## 3 Adelie  Biscoe           36.4          17.1              184        2850 fema~
## 4 Adelie  Biscoe           34.5          18.1              187        2900 fema~
## 5 Adelie  Dream            33.1          16.1              178        2900 fema~
## # ... with 1 more variable: year <int>

Let’s order the penguins from heaviest to lightest, and use this to find the five heaviest penguins

penguins %>% arrange(desc(body_mass_g)) %>% slice_head(n=5)

## # A tibble: 5 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Gentoo  Biscoe           49.2          15.2              221        6300 male 
## 2 Gentoo  Biscoe           59.6          17                230        6050 male 
## 3 Gentoo  Biscoe           51.1          16.3              220        6000 male 
## 4 Gentoo  Biscoe           48.8          16.2              222        6000 male 
## 5 Gentoo  Biscoe           45.2          16.4              223        5950 male 
## # ... with 1 more variable: year <int>

We can add a second variable to help in the event of ties.

penguins %>% arrange(body_mass_g, flipper_length_mm) %>% slice_head(n=5)

## # A tibble: 5 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Chinst~ Dream            46.9          16.6              192        2700 fema~
## 2 Adelie  Biscoe           36.5          16.6              181        2850 fema~
## 3 Adelie  Biscoe           36.4          17.1              184        2850 fema~
## 4 Adelie  Dream            33.1          16.1              178        2900 fema~
## 5 Adelie  Biscoe           34.5          18.1              187        2900 fema~
## # ... with 1 more variable: year <int>

2.2 Column Operations

Column operations include selecting variables, renaming variables, mutating variables (creating new variables), and relocating variables within the data-frame.

2.2.1 Select

Selecting variables is a key action in data wrangling. You can use select to create a new data-frame with a subset of the variables, or you can use select to drop variables from a data frame.

First, let’s make a new data frame with just a couple of the variables. Simply list the variables you want to select, with commas separating the variables.

penguins %>% select(bill_length_mm, bill_depth_mm) -> newdata
newdata %>% glimpse()

## Rows: 344
## Columns: 2
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.~
## $ bill_depth_mm  <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.~

Now let’s create a data-frame without a couple variables. In other words, let’s drop a couple variables. To do this, list the variables you want to drop with a minus sign in front of each.

penguins %>% select(-bill_length_mm, -bill_depth_mm) -> newdata

Now we can see we have dropped two variables.

dim(penguins)

## [1] 344   8

dim(newdata)

## [1] 344   6

The last thing we’ll cover is selecting several variables in a row with the colon. Pretty cool!

penguins %>% select(bill_length_mm:body_mass_g, species) %>% dim()

## [1] 344   5

2.2.2 Rename

Let’s rename the bill length and bill depth variables. Put the new variable name on the left side of the = and the old variable name on the right side.

penguins %>% rename(b_length = bill_length_mm, b_depth = bill_depth_mm, f_length = flipper_length_mm) -> penguins
names(penguins)

## [1] "species"     "island"      "b_length"    "b_depth"     "f_length"   
## [6] "body_mass_g" "sex"         "year"

2.2.3 Mutate and Transmute

We use mutate to create new variables. For example, say we want the measurements in inches instead of mm. We need to divide the mm measurements by 25.4 (roughly) to have measurements in inches.

penguins %>% mutate(b_length_in = b_length/25.4, b_depth_in = b_depth/25.4, f_length_in = f_length/25.4) -> penguins2
penguins2 %>% select(b_length:f_length, b_length_in:f_length_in) %>% slice_sample()

## # A tibble: 1 x 6
##   b_length b_depth f_length b_length_in b_depth_in f_length_in
##      <dbl>   <dbl>    <int>       <dbl>      <dbl>       <dbl>
## 1       50    19.5      196        1.97      0.768        7.72

When we used mutate, we created 3 new variables and kept the 8 old variables, giving us a data frame with 11 total variables.

dim(penguins)

## [1] 344   8

dim(penguins2)

## [1] 344  11

If we want the data frame to contain the new variables only, we can use transmute.

penguins %>% transmute(b_length_in = b_length/25.4, b_depth_in = b_depth/25.4, f_length_in = f_length/25.4) %>% names()

## [1] "b_length_in" "b_depth_in"  "f_length_in"

2.2.4 Relocate

You can move variables around in the data-frame using relocate. You can move a variable (or multiple variables) to the front really easily, as this is the default. Just list the variable(s) you want at the front. First, I’ll just move island to the front.

penguins %>% relocate(island) %>% names()

## [1] "island"      "species"     "b_length"    "b_depth"     "f_length"   
## [6] "body_mass_g" "sex"         "year"

Now let’s move a few to the front.

penguins %>% relocate(b_length:f_length) %>% names()

## [1] "b_length"    "b_depth"     "f_length"    "species"     "island"     
## [6] "body_mass_g" "sex"         "year"

You can move a variable ahead of another variable.

penguins %>% relocate(island, .before = year) %>% names()

## [1] "species"     "b_length"    "b_depth"     "f_length"    "body_mass_g"
## [6] "sex"         "island"      "year"

You can move a variable behind another variable.

penguins %>% relocate(island, .after = b_depth) %>% names()

## [1] "species"     "b_length"    "b_depth"     "island"      "f_length"   
## [6] "body_mass_g" "sex"         "year"

You can also move several variables.

penguins %>% relocate(b_length:f_length, .after = year) %>% names()

## [1] "species"     "island"      "body_mass_g" "sex"         "year"       
## [6] "b_length"    "b_depth"     "f_length"

2.3 Group By

If you want to find statistics for each group (such as the mean flipper length for penguins on each island), you will need to group the data before calculating the statistic. Let’s calculate the mean flipper length for penguins on each island to demonstrate grouping. Since we have missing data, I’ve added the na.rm = TRUE argument to the mean command.

penguins %>% group_by(island) %>% summarize(mean(f_length, na.rm = TRUE))

## # A tibble: 3 x 2
##   island    `mean(f_length, na.rm = TRUE)`
##   <fct>                              <dbl>
## 1 Biscoe                              210.
## 2 Dream                               193.
## 3 Torgersen                           191.

I think the column name is a bit clunky, so let’s make it more reader friendly by giving it a name.

penguins %>% group_by(island) %>% summarize(flipper_means = mean(f_length, na.rm = TRUE))

## # A tibble: 3 x 2
##   island    flipper_means
##   <fct>             <dbl>
## 1 Biscoe             210.
## 2 Dream              193.
## 3 Torgersen          191.

This is like using tapply in base R (I kind of hate tapply because the syntax is so clunky) or using lm to find group-wise means.

tapply(penguins$f_length, FUN = mean, INDEX = penguins$island, na.rm = TRUE)

##    Biscoe     Dream Torgersen 
##  209.7066  193.0726  191.1961

lm(f_length ~ 0+ island, data = penguins)

## 
## Call:
## lm(formula = f_length ~ 0 + island, data = penguins)
## 
## Coefficients:
##    islandBiscoe      islandDream  islandTorgersen  
##           209.7            193.1            191.2

If we are curious about the sample size for each island, we can also use grouping.

penguins %>% group_by(island) %>% summarize(n = n())

## # A tibble: 3 x 2
##   island        n
##   <fct>     <int>
## 1 Biscoe      168
## 2 Dream       124
## 3 Torgersen    52

If we want both these in the same table, then include both in the summarize command. I’ve labeled each column.

penguins %>% group_by(island) %>% summarize(n = n(), flipper_means = mean(f_length, na.rm = TRUE))

## # A tibble: 3 x 3
##   island        n flipper_means
##   <fct>     <int>         <dbl>
## 1 Biscoe      168          210.
## 2 Dream       124          193.
## 3 Torgersen    52          191.

We can also group by two variables (such as island AND species). Since island is listed first, it’ll put all the species on Biscoe island first, then it’ll list the species on Dream island, and then it’ll list the species on Torgersen island. In other words, the groups are listed alphabetically by island first, and then by species.

penguins %>% group_by(island, species)  %>% summarize(n = n())

## `summarise()` has grouped output by 'island'. You can override using the `.groups` argument.

## # A tibble: 5 x 3
## # Groups:   island [3]
##   island    species       n
##   <fct>     <fct>     <int>
## 1 Biscoe    Adelie       44
## 2 Biscoe    Gentoo      124
## 3 Dream     Adelie       56
## 4 Dream     Chinstrap    68
## 5 Torgersen Adelie       52

The easiest way to list the groups by species first is to simply change the order of the factors in the group_by command. Now we have the three Adelie rows first.

penguins %>% group_by(species, island)  %>% summarize(n = n())

## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.

## # A tibble: 5 x 3
## # Groups:   species [3]
##   species   island        n
##   <fct>     <fct>     <int>
## 1 Adelie    Biscoe       44
## 2 Adelie    Dream        56
## 3 Adelie    Torgersen    52
## 4 Chinstrap Dream        68
## 5 Gentoo    Biscoe      124

3 Two Data Frames: Joining

Now we are ready to talk about multiple data frames. You often want to join a couple data frames to create a new data frame.

3.1 Data Prep

For this section, we will use data on beer from a few breweries in Minnesota.

beer <- read.csv("https://cknudson.com/data/MNbeer.csv")
names(beer)

## [1] "Brewery"     "Beer"        "Description" "Style"       "ABV"        
## [6] "IBU"         "Rating"      "Good"

tibble(beer)

## # A tibble: 44 x 8
##    Brewery     Beer         Description           Style   ABV   IBU Rating  Good
##    <chr>       <chr>        <chr>                 <chr> <dbl> <int>  <int> <int>
##  1 Bauhaus     Wonderstuff  New Bohemian Pilsner  Lager   5.4    48     88     0
##  2 Bauhaus     Stargazer    German Style Schwarz~ Lager   5      28     87     0
##  3 Bauhaus     Wagon Party  West Cost Style Lager Lager   5.4    55     86     0
##  4 Bauhaus     Sky-Five!    Midwest Coast IPA     IPA     6.7    70     86     0
##  5 Bent Paddle Kanu         Session Pale Ale      Ale     4.8    48     85     0
##  6 Bent Paddle Venture Pils Pilsner Lager         Lager   5      38     87     0
##  7 Bent Paddle Bent Hop     Golden IPA            IPA     6.2    68     89     0
##  8 Bent Paddle 14* ESB      Extra Special Amber ~ Ale     5.6    32     88     0
##  9 Bent Paddle Black Ale    Black Ale             Ale     6      34     89     0
## 10 Indeed      Day Tripper  Pale Ale              Ale     5.4    45     90     1
## # ... with 34 more rows

Before we can begin, we peek at the data and notice that Brewery and Style are characters and we want them to be factors. What techniques have we learned that would be useful? We can MUTATE the variables.

beer %>% mutate(Brewery = as.factor(Brewery), Style = as.factor(Style)) -> beer
tibble(beer)

## # A tibble: 44 x 8
##    Brewery     Beer         Description           Style   ABV   IBU Rating  Good
##    <fct>       <chr>        <chr>                 <fct> <dbl> <int>  <int> <int>
##  1 Bauhaus     Wonderstuff  New Bohemian Pilsner  Lager   5.4    48     88     0
##  2 Bauhaus     Stargazer    German Style Schwarz~ Lager   5      28     87     0
##  3 Bauhaus     Wagon Party  West Cost Style Lager Lager   5.4    55     86     0
##  4 Bauhaus     Sky-Five!    Midwest Coast IPA     IPA     6.7    70     86     0
##  5 Bent Paddle Kanu         Session Pale Ale      Ale     4.8    48     85     0
##  6 Bent Paddle Venture Pils Pilsner Lager         Lager   5      38     87     0
##  7 Bent Paddle Bent Hop     Golden IPA            IPA     6.2    68     89     0
##  8 Bent Paddle 14* ESB      Extra Special Amber ~ Ale     5.6    32     88     0
##  9 Bent Paddle Black Ale    Black Ale             Ale     6      34     89     0
## 10 Indeed      Day Tripper  Pale Ale              Ale     5.4    45     90     1
## # ... with 34 more rows

Now we can see the levels of the factors.

beer %>% summarize(breweries = levels(Brewery))

##       breweries
## 1       Bauhaus
## 2   Bent Paddle
## 3        Fulton
## 4        Indeed
## 5     Steel Toe
## 6        Summit
## 7         Surly
## 8 Urban Growler

beer %>% summarize(styles = levels(Style))

##   styles
## 1    Ale
## 2    IPA
## 3  Lager

Now our first data frame is ready. This section is about joining two data frames, so we need a second data frame. Let’s create a new data frame. This new data frame has some information on the calories for four styles of beer. (I made this data up so don’t read too much into it.)

Style <- c("Lager", "IPA", "Ale", "Stout")
calories <- c(180, 200, 250, 170 )
cals <- data.frame(Style, calories)
glimpse(cals)

## Rows: 4
## Columns: 2
## $ Style    <chr> "Lager", "IPA", "Ale", "Stout"
## $ calories <dbl> 180, 200, 250, 170

3.2 Inner Join

Let’s make a table. Each row will have one style of beer, and we only want to include the styles that are in BOTH tables. (Because we only want styles listed in BOTH tables, this is an inner join.) The variables we will include are the calories and the mean ABV. We know the calorie information is in the cals data frame and the ABV information is in the beer data frame, so we need to calculate the mean ABV for each style and then join this info with the cals table.

We use what we learned about grouping to create the table containing the mean ABV for each style.

beer %>% group_by(Style) %>% summarize(meanABV = mean(ABV)) -> ABVtable
tibble(ABVtable)

## # A tibble: 3 x 2
##   Style meanABV
##   <fct>   <dbl>
## 1 Ale      5.49
## 2 IPA      6.56
## 3 Lager    5.06

Then we can join the two tables.

ABVtable %>% inner_join(cals)

## Joining, by = "Style"

## # A tibble: 3 x 3
##   Style meanABV calories
##   <chr>   <dbl>    <dbl>
## 1 Ale      5.49      250
## 2 IPA      6.56      200
## 3 Lager    5.06      180

3.3 Left Join

Notice how the table created from inner join has three styles? That’s because BOTH of the original tables have those three styles. If we want all four of the styles in the cals table, we can use a right join.

ABVtable %>% right_join(cals)

## Joining, by = "Style"

## # A tibble: 4 x 3
##   Style meanABV calories
##   <chr>   <dbl>    <dbl>
## 1 Ale      5.49      250
## 2 IPA      6.56      200
## 3 Lager    5.06      180
## 4 Stout   NA         170

We could also do a left join and put cals on the left.

cals %>% left_join(ABVtable)

## Joining, by = "Style"

##   Style calories  meanABV
## 1 Lager      180 5.055556
## 2   IPA      200 6.564706
## 3   Ale      250 5.494444
## 4 Stout      170       NA

Let’s do one more example. Let’s start with our original beer table and add calorie information on. We will do a left join, because we want all of the rows from our original beer table.

beer %>% left_join(cals) %>% slice_sample(n=4)

## Joining, by = "Style"

##         Brewery                Beer               Description Style ABV IBU
## 1        Indeed           Peach Bum India Pale Ale with Peach   IPA 6.2  70
## 2        Fulton                Pils       Continental Pilsner Lager 5.3  30
## 3        Fulton Sweet Child of Vine            India Pale Ale   IPA 6.4  69
## 4 Urban Growler   Cowbell Cream Ale                 Cream Ale   Ale 5.2  20
##   Rating Good calories
## 1     85    0      200
## 2     84    0      180
## 3     81    0      200
## 4     82    0      250

dplyr essentials

Christina Knudson

2022-08-02