This document will show how to use R package dplyr to do data wrangling fundamentals for one and two data frames. (If you’re used to SQL, a data frame is just a table.) Throughout this, I’ll show code for accomplishing a task with dplyr, but I’ll also sometimes show how to accomplish the task with base R (in case you’re like me and learned base R first). Before beginning, install the package dplyr by using the GUI or by running the install.packages code that I’ve commented out below.
# install.packages("dplyr")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
First, we will focus on data wrangling using a single data frame. After that, we will join two data frames.
For the first section, we’ll use the penguins data from the palmerpenguins package. Read the data documentation and install the package before beginning.
library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 4.1.1
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex <fct> male, female, female, NA, female, male, female, male~
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
We’ll do the basics first, and then we’ll create more complex sequences of commands. When we have a single data frame, we can think of dplyr mostly operating on rows and columns. A big exception is when we start grouping by a factor (a categorical variable).
Row operations include slicing, filtering, and arranging (like ORDER BY in SQL).
When you slice the data, you snag a few rows. Here I grab rows 3 through 6.
penguins %>% slice(3:6)
## # A tibble: 4 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge~ 40.3 18 195 3250 fema~
## 2 Adelie Torge~ NA NA NA NA <NA>
## 3 Adelie Torge~ 36.7 19.3 193 3450 fema~
## 4 Adelie Torge~ 39.3 20.6 190 3650 male
## # ... with 1 more variable: year <int>
If you’re used to base R, then slicing is an alternative to indexing rows using square brackets, like this:
penguins[3:6,]
## # A tibble: 4 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge~ 40.3 18 195 3250 fema~
## 2 Adelie Torge~ NA NA NA NA <NA>
## 3 Adelie Torge~ 36.7 19.3 193 3450 fema~
## 4 Adelie Torge~ 39.3 20.6 190 3650 male
## # ... with 1 more variable: year <int>
Quick comment: we all love cake, but don’t think of cake slices because you slice a cake vertically, and these slices are cutting the data horizontally.
You can use two special commands to see the top or bottom of the data. These are the base R analogs of the head and tail commands, respectively. Just specify how many rows from the top (or bottom) you want.
penguins %>% slice_head(n=4)
## # A tibble: 4 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge~ 39.1 18.7 181 3750 male
## 2 Adelie Torge~ 39.5 17.4 186 3800 fema~
## 3 Adelie Torge~ 40.3 18 195 3250 fema~
## 4 Adelie Torge~ NA NA NA NA <NA>
## # ... with 1 more variable: year <int>
penguins %>% slice_tail(n=5)
## # A tibble: 5 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Chinst~ Dream 55.8 19.8 207 4000 male
## 2 Chinst~ Dream 43.5 18.1 202 3400 fema~
## 3 Chinst~ Dream 49.6 18.2 193 3775 male
## 4 Chinst~ Dream 50.8 19 210 4100 male
## 5 Chinst~ Dream 50.2 18.7 198 3775 fema~
## # ... with 1 more variable: year <int>
Finally, our last slice command takes randomly selected rows. By default, this will sample without replacement. Here we are 4 randomly selected rows.
penguins %>% slice_sample(n=4)
## # A tibble: 4 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Chinst~ Dream 46.8 16.5 189 3650 fema~
## 2 Adelie Dream 44.1 19.7 196 4400 male
## 3 Gentoo Biscoe 44 13.6 208 4350 fema~
## 4 Chinst~ Dream 52 18.1 201 4050 male
## # ... with 1 more variable: year <int>
If you want to sample with replacement, add the replace = TRUE argument to the command.
penguins %>% slice_sample(n = 3, replace = TRUE)
## # A tibble: 3 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Gentoo Biscoe 49.8 15.9 229 5950 male
## 2 Gentoo Biscoe 43.6 13.9 217 4900 fema~
## 3 Adelie Biscoe 40.6 18.8 193 3800 male
## # ... with 1 more variable: year <int>
Advanced note: if you want to create a bootstrap sample, use slice_sample with the original sample size, and save your slice. With tidyverse, I like the forwards arrow for assignments because it follows the usual flow of data (tidyverse code flows to the right).
penguins %>% summarize(n()) -> npenguins
as.numeric(npenguins)
## [1] 344
penguins %>% slice_sample(n = as.numeric(npenguins), replace = TRUE) -> bootstrapsample
glimpse(bootstrapsample)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Chinstrap, Gentoo, Gentoo, Adelie, A~
## $ island <fct> Biscoe, Dream, Dream, Biscoe, Biscoe, Dream, Dream, ~
## $ bill_length_mm <dbl> 39.0, 39.5, 52.0, 48.4, 43.6, 39.7, 39.5, 38.9, 42.5~
## $ bill_depth_mm <dbl> 17.5, 17.8, 18.1, 16.3, 13.9, 17.9, 16.7, 18.8, 17.3~
## $ flipper_length_mm <int> 186, 188, 201, 220, 217, 193, 178, 190, 187, 215, 21~
## $ body_mass_g <int> 3550, 3300, 4050, 5400, 4900, 4250, 3250, 3600, 3350~
## $ sex <fct> female, female, male, male, female, male, female, fe~
## $ year <int> 2008, 2007, 2007, 2008, 2008, 2009, 2007, 2008, 2009~
When you filter your rows, you choose rows based on some column values. For example, you might want to see only the rows for penguins from Dream island. This will output a lot of rows, so I’m going to just get a glimpse of the data.
penguins %>% filter(island == "Dream") %>% glimpse()
## Rows: 124
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island <fct> Dream, Dream, Dream, Dream, Dream, Dream, Dream, Dre~
## $ bill_length_mm <dbl> 39.5, 37.2, 39.5, 40.9, 36.4, 39.2, 38.8, 42.2, 37.6~
## $ bill_depth_mm <dbl> 16.7, 18.1, 17.8, 18.9, 17.0, 21.1, 20.0, 18.5, 19.3~
## $ flipper_length_mm <int> 178, 178, 188, 184, 195, 196, 190, 180, 181, 184, 18~
## $ body_mass_g <int> 3250, 3900, 3300, 3900, 3325, 4150, 3950, 3550, 3300~
## $ sex <fct> female, male, female, male, female, male, male, fema~
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
Here’s a glimpse of the penguins with bill lengths exceeding 30 mm.
penguins %>% filter(bill_length_mm < 35) %>% glimpse()
## Rows: 9
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island <fct> Torgersen, Torgersen, Torgersen, Biscoe, Torgersen, ~
## $ bill_length_mm <dbl> 34.1, 34.6, 34.4, 34.5, 33.5, 34.6, 34.0, 33.1, 32.1
## $ bill_depth_mm <dbl> 18.1, 21.1, 18.4, 18.1, 19.0, 17.2, 17.1, 16.1, 15.5
## $ flipper_length_mm <int> 193, 198, 184, 187, 190, 189, 185, 178, 188
## $ body_mass_g <int> 3475, 4400, 3325, 2900, 3600, 3200, 3400, 2900, 3050
## $ sex <fct> NA, male, female, female, female, female, female, fe~
## $ year <int> 2007, 2007, 2007, 2008, 2008, 2008, 2008, 2008, 2009
This is just like square bracket indexing in base R or subsetting with the subset command.
head(penguins[penguins$island == "Dream",])
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Dream 39.5 16.7 178 3250 fema~
## 2 Adelie Dream 37.2 18.1 178 3900 male
## 3 Adelie Dream 39.5 17.8 188 3300 fema~
## 4 Adelie Dream 40.9 18.9 184 3900 male
## 5 Adelie Dream 36.4 17 195 3325 fema~
## 6 Adelie Dream 39.2 21.1 196 4150 male
## # ... with 1 more variable: year <int>
head(subset(penguins, island == "Dream"))
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Dream 39.5 16.7 178 3250 fema~
## 2 Adelie Dream 37.2 18.1 178 3900 male
## 3 Adelie Dream 39.5 17.8 188 3300 fema~
## 4 Adelie Dream 40.9 18.9 184 3900 male
## 5 Adelie Dream 36.4 17 195 3325 fema~
## 6 Adelie Dream 39.2 21.1 196 4150 male
## # ... with 1 more variable: year <int>
You can filter based on a couple criteria by using another pipe and another filter.
penguins %>% filter(island == "Dream") %>% filter(bill_length_mm < 35) %>% glimpse()
## Rows: 3
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie
## $ island <fct> Dream, Dream, Dream
## $ bill_length_mm <dbl> 34.0, 33.1, 32.1
## $ bill_depth_mm <dbl> 17.1, 16.1, 15.5
## $ flipper_length_mm <int> 185, 178, 188
## $ body_mass_g <int> 3400, 2900, 3050
## $ sex <fct> female, female, female
## $ year <int> 2008, 2008, 2009
Use arrange to order the rows of the data. This is like ORDER BY in SQL. Let’s order the penguins from lightest to heaviest. Then I’ll slice the head. This will give us the five lightest penguins.
penguins %>% arrange(body_mass_g) %>% slice_head(n=5)
## # A tibble: 5 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Chinst~ Dream 46.9 16.6 192 2700 fema~
## 2 Adelie Biscoe 36.5 16.6 181 2850 fema~
## 3 Adelie Biscoe 36.4 17.1 184 2850 fema~
## 4 Adelie Biscoe 34.5 18.1 187 2900 fema~
## 5 Adelie Dream 33.1 16.1 178 2900 fema~
## # ... with 1 more variable: year <int>
Let’s order the penguins from heaviest to lightest, and use this to find the five heaviest penguins
penguins %>% arrange(desc(body_mass_g)) %>% slice_head(n=5)
## # A tibble: 5 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Gentoo Biscoe 49.2 15.2 221 6300 male
## 2 Gentoo Biscoe 59.6 17 230 6050 male
## 3 Gentoo Biscoe 51.1 16.3 220 6000 male
## 4 Gentoo Biscoe 48.8 16.2 222 6000 male
## 5 Gentoo Biscoe 45.2 16.4 223 5950 male
## # ... with 1 more variable: year <int>
We can add a second variable to help in the event of ties.
penguins %>% arrange(body_mass_g, flipper_length_mm) %>% slice_head(n=5)
## # A tibble: 5 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Chinst~ Dream 46.9 16.6 192 2700 fema~
## 2 Adelie Biscoe 36.5 16.6 181 2850 fema~
## 3 Adelie Biscoe 36.4 17.1 184 2850 fema~
## 4 Adelie Dream 33.1 16.1 178 2900 fema~
## 5 Adelie Biscoe 34.5 18.1 187 2900 fema~
## # ... with 1 more variable: year <int>
Column operations include selecting variables, renaming variables, mutating variables (creating new variables), and relocating variables within the data-frame.
Selecting variables is a key action in data wrangling. You can use select to create a new data-frame with a subset of the variables, or you can use select to drop variables from a data frame.
First, let’s make a new data frame with just a couple of the variables. Simply list the variables you want to select, with commas separating the variables.
penguins %>% select(bill_length_mm, bill_depth_mm) -> newdata
newdata %>% glimpse()
## Rows: 344
## Columns: 2
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.~
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.~
Now let’s create a data-frame without a couple variables. In other words, let’s drop a couple variables. To do this, list the variables you want to drop with a minus sign in front of each.
penguins %>% select(-bill_length_mm, -bill_depth_mm) -> newdata
Now we can see we have dropped two variables.
dim(penguins)
## [1] 344 8
dim(newdata)
## [1] 344 6
The last thing we’ll cover is selecting several variables in a row with the colon. Pretty cool!
penguins %>% select(bill_length_mm:body_mass_g, species) %>% dim()
## [1] 344 5
Let’s rename the bill length and bill depth variables. Put the new variable name on the left side of the = and the old variable name on the right side.
penguins %>% rename(b_length = bill_length_mm, b_depth = bill_depth_mm, f_length = flipper_length_mm) -> penguins
names(penguins)
## [1] "species" "island" "b_length" "b_depth" "f_length"
## [6] "body_mass_g" "sex" "year"
We use mutate to create new variables. For example, say we want the measurements in inches instead of mm. We need to divide the mm measurements by 25.4 (roughly) to have measurements in inches.
penguins %>% mutate(b_length_in = b_length/25.4, b_depth_in = b_depth/25.4, f_length_in = f_length/25.4) -> penguins2
penguins2 %>% select(b_length:f_length, b_length_in:f_length_in) %>% slice_sample()
## # A tibble: 1 x 6
## b_length b_depth f_length b_length_in b_depth_in f_length_in
## <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 50 19.5 196 1.97 0.768 7.72
When we used mutate, we created 3 new variables and kept the 8 old variables, giving us a data frame with 11 total variables.
dim(penguins)
## [1] 344 8
dim(penguins2)
## [1] 344 11
If we want the data frame to contain the new variables only, we can use transmute.
penguins %>% transmute(b_length_in = b_length/25.4, b_depth_in = b_depth/25.4, f_length_in = f_length/25.4) %>% names()
## [1] "b_length_in" "b_depth_in" "f_length_in"
You can move variables around in the data-frame using relocate. You can move a variable (or multiple variables) to the front really easily, as this is the default. Just list the variable(s) you want at the front. First, I’ll just move island to the front.
penguins %>% relocate(island) %>% names()
## [1] "island" "species" "b_length" "b_depth" "f_length"
## [6] "body_mass_g" "sex" "year"
Now let’s move a few to the front.
penguins %>% relocate(b_length:f_length) %>% names()
## [1] "b_length" "b_depth" "f_length" "species" "island"
## [6] "body_mass_g" "sex" "year"
You can move a variable ahead of another variable.
penguins %>% relocate(island, .before = year) %>% names()
## [1] "species" "b_length" "b_depth" "f_length" "body_mass_g"
## [6] "sex" "island" "year"
You can move a variable behind another variable.
penguins %>% relocate(island, .after = b_depth) %>% names()
## [1] "species" "b_length" "b_depth" "island" "f_length"
## [6] "body_mass_g" "sex" "year"
You can also move several variables.
penguins %>% relocate(b_length:f_length, .after = year) %>% names()
## [1] "species" "island" "body_mass_g" "sex" "year"
## [6] "b_length" "b_depth" "f_length"
If you want to find statistics for each group (such as the mean flipper length for penguins on each island), you will need to group the data before calculating the statistic. Let’s calculate the mean flipper length for penguins on each island to demonstrate grouping. Since we have missing data, I’ve added the na.rm = TRUE argument to the mean command.
penguins %>% group_by(island) %>% summarize(mean(f_length, na.rm = TRUE))
## # A tibble: 3 x 2
## island `mean(f_length, na.rm = TRUE)`
## <fct> <dbl>
## 1 Biscoe 210.
## 2 Dream 193.
## 3 Torgersen 191.
I think the column name is a bit clunky, so let’s make it more reader friendly by giving it a name.
penguins %>% group_by(island) %>% summarize(flipper_means = mean(f_length, na.rm = TRUE))
## # A tibble: 3 x 2
## island flipper_means
## <fct> <dbl>
## 1 Biscoe 210.
## 2 Dream 193.
## 3 Torgersen 191.
This is like using tapply in base R (I kind of hate tapply because the syntax is so clunky) or using lm to find group-wise means.
tapply(penguins$f_length, FUN = mean, INDEX = penguins$island, na.rm = TRUE)
## Biscoe Dream Torgersen
## 209.7066 193.0726 191.1961
lm(f_length ~ 0+ island, data = penguins)
##
## Call:
## lm(formula = f_length ~ 0 + island, data = penguins)
##
## Coefficients:
## islandBiscoe islandDream islandTorgersen
## 209.7 193.1 191.2
If we are curious about the sample size for each island, we can also use grouping.
penguins %>% group_by(island) %>% summarize(n = n())
## # A tibble: 3 x 2
## island n
## <fct> <int>
## 1 Biscoe 168
## 2 Dream 124
## 3 Torgersen 52
If we want both these in the same table, then include both in the summarize command. I’ve labeled each column.
penguins %>% group_by(island) %>% summarize(n = n(), flipper_means = mean(f_length, na.rm = TRUE))
## # A tibble: 3 x 3
## island n flipper_means
## <fct> <int> <dbl>
## 1 Biscoe 168 210.
## 2 Dream 124 193.
## 3 Torgersen 52 191.
We can also group by two variables (such as island AND species). Since island is listed first, it’ll put all the species on Biscoe island first, then it’ll list the species on Dream island, and then it’ll list the species on Torgersen island. In other words, the groups are listed alphabetically by island first, and then by species.
penguins %>% group_by(island, species) %>% summarize(n = n())
## `summarise()` has grouped output by 'island'. You can override using the `.groups` argument.
## # A tibble: 5 x 3
## # Groups: island [3]
## island species n
## <fct> <fct> <int>
## 1 Biscoe Adelie 44
## 2 Biscoe Gentoo 124
## 3 Dream Adelie 56
## 4 Dream Chinstrap 68
## 5 Torgersen Adelie 52
The easiest way to list the groups by species first is to simply change the order of the factors in the group_by command. Now we have the three Adelie rows first.
penguins %>% group_by(species, island) %>% summarize(n = n())
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
## # A tibble: 5 x 3
## # Groups: species [3]
## species island n
## <fct> <fct> <int>
## 1 Adelie Biscoe 44
## 2 Adelie Dream 56
## 3 Adelie Torgersen 52
## 4 Chinstrap Dream 68
## 5 Gentoo Biscoe 124
Now we are ready to talk about multiple data frames. You often want to join a couple data frames to create a new data frame.
For this section, we will use data on beer from a few breweries in Minnesota.
beer <- read.csv("https://cknudson.com/data/MNbeer.csv")
names(beer)
## [1] "Brewery" "Beer" "Description" "Style" "ABV"
## [6] "IBU" "Rating" "Good"
tibble(beer)
## # A tibble: 44 x 8
## Brewery Beer Description Style ABV IBU Rating Good
## <chr> <chr> <chr> <chr> <dbl> <int> <int> <int>
## 1 Bauhaus Wonderstuff New Bohemian Pilsner Lager 5.4 48 88 0
## 2 Bauhaus Stargazer German Style Schwarz~ Lager 5 28 87 0
## 3 Bauhaus Wagon Party West Cost Style Lager Lager 5.4 55 86 0
## 4 Bauhaus Sky-Five! Midwest Coast IPA IPA 6.7 70 86 0
## 5 Bent Paddle Kanu Session Pale Ale Ale 4.8 48 85 0
## 6 Bent Paddle Venture Pils Pilsner Lager Lager 5 38 87 0
## 7 Bent Paddle Bent Hop Golden IPA IPA 6.2 68 89 0
## 8 Bent Paddle 14* ESB Extra Special Amber ~ Ale 5.6 32 88 0
## 9 Bent Paddle Black Ale Black Ale Ale 6 34 89 0
## 10 Indeed Day Tripper Pale Ale Ale 5.4 45 90 1
## # ... with 34 more rows
Before we can begin, we peek at the data and notice that Brewery and Style are characters and we want them to be factors. What techniques have we learned that would be useful? We can MUTATE the variables.
beer %>% mutate(Brewery = as.factor(Brewery), Style = as.factor(Style)) -> beer
tibble(beer)
## # A tibble: 44 x 8
## Brewery Beer Description Style ABV IBU Rating Good
## <fct> <chr> <chr> <fct> <dbl> <int> <int> <int>
## 1 Bauhaus Wonderstuff New Bohemian Pilsner Lager 5.4 48 88 0
## 2 Bauhaus Stargazer German Style Schwarz~ Lager 5 28 87 0
## 3 Bauhaus Wagon Party West Cost Style Lager Lager 5.4 55 86 0
## 4 Bauhaus Sky-Five! Midwest Coast IPA IPA 6.7 70 86 0
## 5 Bent Paddle Kanu Session Pale Ale Ale 4.8 48 85 0
## 6 Bent Paddle Venture Pils Pilsner Lager Lager 5 38 87 0
## 7 Bent Paddle Bent Hop Golden IPA IPA 6.2 68 89 0
## 8 Bent Paddle 14* ESB Extra Special Amber ~ Ale 5.6 32 88 0
## 9 Bent Paddle Black Ale Black Ale Ale 6 34 89 0
## 10 Indeed Day Tripper Pale Ale Ale 5.4 45 90 1
## # ... with 34 more rows
Now we can see the levels of the factors.
beer %>% summarize(breweries = levels(Brewery))
## breweries
## 1 Bauhaus
## 2 Bent Paddle
## 3 Fulton
## 4 Indeed
## 5 Steel Toe
## 6 Summit
## 7 Surly
## 8 Urban Growler
beer %>% summarize(styles = levels(Style))
## styles
## 1 Ale
## 2 IPA
## 3 Lager
Now our first data frame is ready. This section is about joining two data frames, so we need a second data frame. Let’s create a new data frame. This new data frame has some information on the calories for four styles of beer. (I made this data up so don’t read too much into it.)
Style <- c("Lager", "IPA", "Ale", "Stout")
calories <- c(180, 200, 250, 170 )
cals <- data.frame(Style, calories)
glimpse(cals)
## Rows: 4
## Columns: 2
## $ Style <chr> "Lager", "IPA", "Ale", "Stout"
## $ calories <dbl> 180, 200, 250, 170
Let’s make a table. Each row will have one style of beer, and we only want to include the styles that are in BOTH tables. (Because we only want styles listed in BOTH tables, this is an inner join.) The variables we will include are the calories and the mean ABV. We know the calorie information is in the cals data frame and the ABV information is in the beer data frame, so we need to calculate the mean ABV for each style and then join this info with the cals table.
We use what we learned about grouping to create the table containing the mean ABV for each style.
beer %>% group_by(Style) %>% summarize(meanABV = mean(ABV)) -> ABVtable
tibble(ABVtable)
## # A tibble: 3 x 2
## Style meanABV
## <fct> <dbl>
## 1 Ale 5.49
## 2 IPA 6.56
## 3 Lager 5.06
Then we can join the two tables.
ABVtable %>% inner_join(cals)
## Joining, by = "Style"
## # A tibble: 3 x 3
## Style meanABV calories
## <chr> <dbl> <dbl>
## 1 Ale 5.49 250
## 2 IPA 6.56 200
## 3 Lager 5.06 180
Notice how the table created from inner join has three styles? That’s because BOTH of the original tables have those three styles. If we want all four of the styles in the cals table, we can use a right join.
ABVtable %>% right_join(cals)
## Joining, by = "Style"
## # A tibble: 4 x 3
## Style meanABV calories
## <chr> <dbl> <dbl>
## 1 Ale 5.49 250
## 2 IPA 6.56 200
## 3 Lager 5.06 180
## 4 Stout NA 170
We could also do a left join and put cals on the left.
cals %>% left_join(ABVtable)
## Joining, by = "Style"
## Style calories meanABV
## 1 Lager 180 5.055556
## 2 IPA 200 6.564706
## 3 Ale 250 5.494444
## 4 Stout 170 NA
Let’s do one more example. Let’s start with our original beer table and add calorie information on. We will do a left join, because we want all of the rows from our original beer table.
beer %>% left_join(cals) %>% slice_sample(n=4)
## Joining, by = "Style"
## Brewery Beer Description Style ABV IBU
## 1 Indeed Peach Bum India Pale Ale with Peach IPA 6.2 70
## 2 Fulton Pils Continental Pilsner Lager 5.3 30
## 3 Fulton Sweet Child of Vine India Pale Ale IPA 6.4 69
## 4 Urban Growler Cowbell Cream Ale Cream Ale Ale 5.2 20
## Rating Good calories
## 1 85 0 200
## 2 84 0 180
## 3 81 0 200
## 4 82 0 250