library(tidyverse)
#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.2
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errorsR Tutorial 2: Data Transformation
MKT 410: Marketing Analytics
Learning Objectives
In this tutorial, we will cover data transformation using the dplyr package, which is one of the most important parts of understanding your data. We will assume, for now, that the data being used has already been imported and tidied. We will cover each of these steps in the next two tutorials.
The key elements of data transformation are:
- Operating on rows and columns
- The “pipe” tool to combine “verbs” when transforming data
- Working with groups within the data
Prerequisites
In this tutorial, we will focus on the dplyr package from tidyverse.
To help us learn the basics of data transformation, we will use the flights dataset within nycflights13 (note, you can call specific functions or datasets from libraries that have not been directly loaded by using two colons ::, e.g. nycflights13::flights). This dataset contains all 336,776 flights that departed from New York City in 2013.
# install.packages("nycflights13") ## install first if needed
library(nycflights13)
flights
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …flights is a “tibble”, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
Ways to view more of the dataset include:
View(flights)print(flights, width = Inf)glimpse(flights)glimpse(flights) #> Rows: 336,776 #> Columns: 19 #> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013… #> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… #> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… #> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55… #> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60… #> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,… #> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8… #> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8… #> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,… #> $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"… #> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301… #> $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N… #> $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG… #> $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA… #> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149… #> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73… #> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6… #> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59… #> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0…
dplyr Basics
There are some basic “rules” for when using dplyr functions (or verbs):
- The first argument is always a data frame.
- The subsequent arguments typically describe which columns to operate on using the variable names (without quotes).
- The output is always a new data frame.
Each verb does one specific transformation to the data frame, so solving complex problems will usually require combining multiple verbs. To do this, we will use a pipe %>%, which takes the thing before it and passes it along as the first argument in the function to the right of it.
x %>% f(y)is equivalent tof(x, y)x %>% f(y) %>% g(z)is equivalent tog(f(x, y), z)
In the context of working with data, we might have something that looks like this:
flights %>%
filter(dest == "IAH") %>%
group_by(year, month, day) %>%
summarize(
arr_delay = mean(arr_delay, na.rm = TRUE)
)dplyr verbs are organized into four groups based on what they operate on: rows, columns, groups, or tables. We’ll go through each of these groups in the following sections.
Rows
We’ll cover four row operations:
filter(): changes which rows are present without changing their orderarrange(): changes the order of the rows without changing which are presentdistinct(): finds rows with unique valuescount(): counts the number of rows within groups of rows
filter
filter() allows you to keep rows based on the values of the columns. The arguments are (in order)
- The data frame
- (and subsequent) The conditions that must be true to keep the row
Let’s find all flights that departed more than 120 minutes late:
flights %>%
filter(dep_delay > 120)
#> # A tibble: 9,723 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 848 1835 853 1001 1950
#> 2 2013 1 1 957 733 144 1056 853
#> 3 2013 1 1 1114 900 134 1447 1222
#> 4 2013 1 1 1540 1338 122 2020 1825
#> 5 2013 1 1 1815 1325 290 2120 1542
#> 6 2013 1 1 1842 1422 260 1958 1535
#> # ℹ 9,717 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …The conditions you can use include:
>: greater than>=: greater than or equal to<: less than<=: less than or equal to==: equal to!=: not equal to%in%: matching specific values
You can also combine conditions using & or , to indicate “and” (i.e. check both conditions) and | to indicate “or” (i.e. check for either condition):
# Flights that departed on January 1
flights %>%
filter(month == 1 & day == 1)
#> # A tibble: 842 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 836 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …# Flights that departed in January or February from LGA or JFK
flights %>%
filter(
month %in% c(1, 2),
origin %in% c("LGA", "JFK")
)
#> # A tibble: 32,955 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 533 529 4 850 830
#> 2 2013 1 1 542 540 2 923 850
#> 3 2013 1 1 544 545 -1 1004 1022
#> 4 2013 1 1 554 600 -6 812 837
#> 5 2013 1 1 557 600 -3 709 723
#> 6 2013 1 1 557 600 -3 838 846
#> # ℹ 32,949 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …Note that the filter() function simply executes the operation and prints out the result, but it does not modify the existing flights dataset. To save the result, we need to assign it to a new object:
jan1 <- flights %>%
filter(month == 1, day == 1)Common Mistakes
When specifying equality in the filter() function, make sure to use double equals signs == and not just one =. The single equal sign = is always used to refer to the value of an argument in a function.
flights %>%
filter(month = 1)
#> Error in `filter()`:
#> ! We detected a named input.
#> ℹ This usually means that you've used `=` instead of `==`.
#> ℹ Did you mean `month == 1`?Another common mistake is using the “or” | operator as follows:
flights %>%
filter(month == 1 | 2)
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …While this doesn’t give an error message, what’s actually happening is that the function checks two conditions, the first being month == 1 and the second being 2. The second condition does not mean anything in this context; the result is a filtered dataset with all observations that are in January only.
arrange()
arrange() changes the order of the rows based on the values of the columns. The arguments are (in order)
- The data frame
- (and subsequent) The columns (or more complicated expressions) to order by
When more than one column name is provided, each additional column will be used to break ties in the values of the preceding columns.
For example, let’s try sorting flights by departure time.
flights %>%
arrange(year, month, day, dep_time)
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …What if we specified the ordered columns in reverse order? Try it out.
flights %>%
arrange(dep_time, day, month, year)
## does the output make sense?You can use desc() on a column inside of arrange() to re-order the data frame based on that column in descending order. The following code orders flights from most to least delayed:
flights %>%
arrange(desc(dep_delay))
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 9 641 900 1301 1242 1530
#> 2 2013 6 15 1432 1935 1137 1607 2120
#> 3 2013 1 10 1121 1635 1126 1239 1810
#> 4 2013 9 20 1139 1845 1014 1457 2210
#> 5 2013 7 22 845 1600 1005 1044 1815
#> 6 2013 4 10 1100 1900 960 1342 2211
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …distinct()
distinct() finds all the unique rows in a dataset. The arguments are:
- A data frame.
- (Optional) Column name(s) for which you want the distinct combinations of. If not provided, the function will return all the unique rows in the data frame.
# Remove duplicate rows, if any
flights %>%
distinct()
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …There are no unique rows in the dataset.
# Find all unique origin and destination pairs
flights %>%
distinct(origin, dest)
#> # A tibble: 224 × 2
#> origin dest
#> <chr> <chr>
#> 1 EWR IAH
#> 2 LGA IAH
#> 3 JFK MIA
#> 4 JFK BQN
#> 5 LGA ATL
#> 6 EWR ORD
#> # ℹ 218 more rowsIf you want to keep other columns when filtering for unique rows, you can use the .keep_all = TRUE option.
flights %>%
distinct(origin, dest, .keep_all = TRUE)
#> # A tibble: 224 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 218 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …Note that distinct() will find the first occurrence of a unique row in the dataset and discard the rest. This is why we see all the distinct flights above departed on January 1.
count()
count() finds the number of occurrences within each unique set of rows. For example, if we wanted to find the number of flights that occurred between any two airports, you can write:
flights %>%
count(origin, dest, sort = TRUE)
#> # A tibble: 224 × 3
#> origin dest n
#> <chr> <chr> <int>
#> 1 JFK LAX 11262
#> 2 LGA ATL 10263
#> 3 LGA ORD 8857
#> 4 JFK SFO 8204
#> 5 LGA CLT 6168
#> 6 EWR ORD 6100
#> # ℹ 218 more rowsThe sort = TRUE argument arranges the number of occurrences in descending order.
Exercises
In a single pipeline for each condition, find all flights that meet the conditions:
- Had an arrival delay of two or more hours
- Flew to Houston (
IAHorHOU) - Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
Sort
flightsto find the flights with the longest departure delays. Find the flights that left earliest in the morning.Sort
flightsto find the fastest flights. (Hint: Try including a math calculation inside of your function.)Was there a flight on every day of 2013?
Which flights traveled the farthest distance? Which traveled the least distance?
Does it matter what order you used
filter()and `arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
Columns
There are four important column operations that we will cover:
mutate(): creates new columns that are derived from the existing columnsselect(): changes which columns are presentrename(): changes the names of the columnsrelocate(): changes the positions of the columns
mutate()
mutate() adds new columns that are calculated from the existing columns.
Let us compute the gain, or how much time a delayed flight made up in the air, and the speed in miles per hour:
flights %>%
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
#> # A tibble: 336,776 × 21
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …By default, mutate() adds new columns on the right-hand side of your dataset, making it difficult to see what’s happening here. The .before argument can be used to add the variables to the left-hand side of the data frame.
flights %>%
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.before = 1 # before the first column
)
#> # A tibble: 336,776 × 21
#> gain speed year month day dep_time sched_dep_time dep_delay arr_time
#> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 -9 370. 2013 1 1 517 515 2 830
#> 2 -16 374. 2013 1 1 533 529 4 850
#> 3 -31 408. 2013 1 1 542 540 2 923
#> 4 17 517. 2013 1 1 544 545 -1 1004
#> 5 19 394. 2013 1 1 554 600 -6 812
#> 6 -16 288. 2013 1 1 554 558 -4 740
#> # ℹ 336,770 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, …We can alternatively specify .after a specific column, e.g.:
flights %>%
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.after = day
)Another useful argument in mutate() is the .keep argument, which specifies which columns to keep among four different options (“all” which is the default; “used” which only retains those used to create the new columns; “unused” which retains only columns that were not used; and “none” which only keeps the newly created column and any grouping variables.)
flights |>
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "used"
)
#> # A tibble: 336,776 × 6
#> dep_delay arr_delay air_time gain hours gain_per_hour
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 11 227 -9 3.78 -2.38
#> 2 4 20 227 -16 3.78 -4.23
#> 3 2 33 160 -31 2.67 -11.6
#> 4 -1 -18 183 17 3.05 5.57
#> 5 -6 -25 116 19 1.93 9.83
#> 6 -4 12 150 -16 2.5 -6.4
#> # ℹ 336,770 more rowsselect()
select() allows you to focus only on specific variables that you’re interested in. There are a few ways to specify the subset of columns.
Select columns by name:
flights %>% select(year, month, day)Select all columns between year and day (inclusive):
flights %>% select(year:day)Select all columns except those from year to day (inclusive):
flights %>% select(!year:day)
Select all columns that are characters:
flights %>% select(where(is.character))
Other helper functions that you can use within select() are:
start_with("abc"): matches names that begin with “abc”ends_with("xyz"): matches names that end with “xyz”contains("ijk"): matches names that contain “ijk”num_range("x", 1:3): matchesx1,x2, andx3
See ?select for more details.
You can also rename variables as you select() then by using =. The new name is on the left-hand side of the =, and the old variable name is on the right-hand side.
flights %>%
select(tail_num = tailnum)
#> # A tibble: 336,776 × 1
#> tail_num
#> <chr>
#> 1 N14228
#> 2 N24211
#> 3 N619AA
#> 4 N804JB
#> 5 N668DN
#> 6 N39463
#> # ℹ 336,770 more rowsrename()
rename() is used to keep all existing variables whiel renaming a few of them.
flights %>%
rename(tail_num = tailnum)
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …relocate()
relocate() moves variables around. By default, relocate() moves variables to the front:
flights %>%
relocate(time_hour, air_time)
#> # A tibble: 336,776 × 19
#> time_hour air_time year month day dep_time sched_dep_time
#> <dttm> <dbl> <int> <int> <int> <int> <int>
#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515
#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529
#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540
#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545
#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600
#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558
#> # ℹ 336,770 more rows
#> # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, …You can also specify where to put them using the .before and .after arguments, just like in mutate():
flights %>%
relocate(year:dep_time, .after = time_hour)
flights %>%
relocate(starts_with("arr"), .before = dep_time)Exercises
Compare
dep_time,sched_dep_time, anddep_delay. How would you expect those three numbers to be related?Brainstorm as many ways as possible to select
dep_time,dep_delay,arr_time, andarr_delayfromflights.What happens if you specify the name of the same variable multiple times in a
select()call?What does the
any_of()function do? Why might it be helpful in conjunction with this vector?variables <- c("year", "month", "day", "dep_delay", "arr_delay")Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
flights %>% select(contains("TIME"))Rename
air_timetoair_time_minto indicate units of measurement and move it to the beginning of the data frame.Why doesn’t the following work, and what does the error mean?
flights %>% select(tailnum) %>% arrange(arr_delay) #> Error in `arrange()`: #> ℹ In argument: `..1 = arr_delay`. #> Caused by error: #> ! object 'arr_delay' not found
The Pipe %>%
We have learned a few different ways to transform our dataset flights, and we saw how to do so using the pipe %>%. But its real power arises when start to combine multiple verbs.
For example, let’s say you wanted to find the fastest flights to Houston’s IAH airport. How would we do that?
- Filter the destination airport to IAH using
filter() - Create a new variable for the speed of each flight using
mutate() - Order by the speed using
arrange() - View relevant columns using
select()
Let’s see it in action:
flights %>%
filter(dest == "IAH") %>%
mutate(speed = distance / air_time * 60) %>%
arrange(desc(speed)) %>%
select(year:day, dep_time, carrier, flight, speed)
#> # A tibble: 7,198 × 7
#> year month day dep_time carrier flight speed
#> <int> <int> <int> <int> <chr> <int> <dbl>
#> 1 2013 7 9 707 UA 226 522.
#> 2 2013 8 27 1850 UA 1128 521.
#> 3 2013 8 28 902 UA 1711 519.
#> 4 2013 8 28 2122 UA 1022 519.
#> 5 2013 6 11 1628 UA 1178 515.
#> 6 2013 8 27 1017 UA 333 515.
#> # ℹ 7,192 more rowsCompare the above code using the pipeline vs. nested function calls:
select(
arrange(
mutate(
filter(
flights,
dest == "IAH"
),
speed = distance / air_time * 60
),
desc(speed)
),
year:day, dep_time, carrier, flight, speed
)Or using intermediate objects:
flights1 <- filter(flights, dest == "IAH")
flights2 <- mutate(flights1, speed = distance / air_time * 60)
flights3 <- select(flights2, year:day, dep_time, carrier, flight, speed)
arrange(flights3, desc(speed))In contrast, the pipeline generally produces data analysis code that is easier to write and read.
A shortcut for adding a pipe to your code is simply Ctrl/Cmd + Shift + M.
Groups
One of the most powerful aspects of dplyr is its ability to work with groups. We’ll focus on three core functions:
group_by(): divides your dataset into groupssummarize(): calculates a single summary statistic (for each group)- The
slice_functions: extracts specific rows within each group
group_by()
group_by() divides your dataset into groups meaningful for your analysis:
flights %>%
group_by(month)
#> # A tibble: 336,776 × 19
#> # Groups: month [12]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …As you’ll notice, the data doesn’t change. However, the output indicates that it is “grouped by” month (Groups: month [12]). This means subsequent operations will now work “by month”. group_by() adds this grouped feature to the data frame, which changes the behavior of the subsequent verbs applied to the data.
summarize()
summarize() calculates summary statistics for each group, reducing the data frame to have a single row for each group.
Let’s try to compute the average departure delay by month.
flights %>%
group_by(month) %>%
summarize(
avg_delay = mean(dep_delay)
)
#> # A tibble: 12 × 2
#> month avg_delay
#> <int> <dbl>
#> 1 1 NA
#> 2 2 NA
#> 3 3 NA
#> 4 4 NA
#> 5 5 NA
#> 6 6 NA
#> # ℹ 6 more rowsWhy are there all the NAs? This happened due to some observed flights having missing data in the delay column. The mean() function cannot calculate a mean when there is missing data. In order to deal with this, we need to tell mean() to ignore missing values using the na.rm argument:
flights %>%
group_by(month) %>%
summarize(
avg_delay = mean(dep_delay, na.rm = TRUE)
)
#> # A tibble: 12 × 2
#> month avg_delay
#> <int> <dbl>
#> 1 1 10.0
#> 2 2 10.8
#> 3 3 13.2
#> 4 4 13.9
#> 5 5 13.0
#> 6 6 20.8
#> # ℹ 6 more rowsYou can create any number of summaries in a single call of summarize(). There are a lot of different summary statistics that we can specify. One very useful one is n(), which returns the number of rows in each group:
flights %>%
group_by(month) %>%
summarize(
avg_delay = mean(dep_delay, na.rm = TRUE),
n = n()
)
#> # A tibble: 12 × 3
#> month avg_delay n
#> <int> <dbl> <int>
#> 1 1 10.0 27004
#> 2 2 10.8 24951
#> 3 3 13.2 28834
#> 4 4 13.9 28330
#> 5 5 13.0 28796
#> 6 6 20.8 28243
#> # ℹ 6 more rowsThe slice_ functions
There are five useful functions that allow you to extract specific rows within each group:
df |> slice_head(n = 1)takes the first row from each groupdf |> slice_tail(n = 1)takes the last row in each groupdf |> slice_min(x, n = 1)takes the row with the smallest value of column xdf |> slice_max(x, n = 1)takes the row with the largest value of column xdf |> slice_sample(n = 1)takes one random row
You can vary n to select more than one row, or use prop instead of n to select a proportion of rows (e.g. prop = 0.1 to select 10% of the rows in each group).
The following code finds the flights that are most delayed upon arrival at each destination:
flights %>%
group_by(dest) %>%
slice_max(arr_delay, n = 1) %>%
relocate(dest)
#> # A tibble: 108 × 19
#> # Groups: dest [105]
#> dest year month day dep_time sched_dep_time dep_delay arr_time
#> <chr> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 ABQ 2013 7 22 2145 2007 98 132
#> 2 ACK 2013 7 23 1139 800 219 1250
#> 3 ALB 2013 1 25 123 2000 323 229
#> 4 ANC 2013 8 17 1740 1625 75 2042
#> 5 ATL 2013 7 22 2257 759 898 121
#> 6 AUS 2013 7 10 2056 1505 351 2347
#> # ℹ 102 more rows
#> # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, …Note that there are 105 destinations but we get 108 rows here. This is because slice_min() and slice_max() keep tied values. If you want exactly one row per group, set with_ties = FALSE.
While we could compute the max delay with summarize(), using the above code allows you to get the whole corresponding row(s) instead of the single summary statistic.
Grouping by multiple variables
You can create groups using more than one variable. For example, we could create a group for each date (year, month, day):
daily <- flights %>%
group_by(year, month, day)
daily
#> # A tibble: 336,776 × 19
#> # Groups: year, month, day [365]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …Note that when you summarize a tibble grouped by more than one variable, each summary “peels off” the last group. For example, if we use summarize(n = n()) on daily, we will be left with a data frame that has year & month groups, but will no longer have each date as a specific subgroup.
daily_flights <- daily %>%
summarize(n = n())
#> `summarise()` has grouped output by 'year', 'month'. You can override using
#> the `.groups` argument.The warning message tells you how to change this default behavior. You can explicitly request the same behavior (.groups = "drop_last") in order to suppress the message.
daily_flights <- daily %>%
summarize(
n = n(),
.groups = "drop_last"
)Alternatively, .groups = "drop" drops all groupings while .groups = "keep" preserves the same groups as the original data frame.
daily_flights <- daily %>%
summarize(
n = n(),
.groups = "drop"
)
daily_flights
#> # A tibble: 365 × 4
#> year month day n
#> <int> <int> <int> <int>
#> 1 2013 1 1 842
#> 2 2013 1 2 943
#> 3 2013 1 3 914
#> 4 2013 1 4 915
#> 5 2013 1 5 720
#> 6 2013 1 6 832
#> # ℹ 359 more rowsUngrouping
You can also directly ungroup a data frame using ungroup(), instead of using summarize().
daily %>%
ungroup()
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ℹ 336,770 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …Let’s see what happens when we summarize an ungrouped data frame.
daily %>%
ungroup() %>%
summarize(
avg_delay = mean(dep_delay, na.rm = TRUE),
flights = n()
)
#> # A tibble: 1 × 2
#> avg_delay flights
#> <dbl> <int>
#> 1 12.6 336776.by
Instead of using group_by() as a separate function in the pipeline, dplyr allows you to specify the groups you want to summarize data for within the summarize() function itself using the .by argument.
flights %>%
summarize(
delay = mean(dep_delay, na.rm = TRUE),
n = n(),
.by = month
)
#> # A tibble: 12 × 3
#> month delay n
#> <int> <dbl> <int>
#> 1 1 10.0 27004
#> 2 10 6.24 28889
#> 3 11 5.44 27268
#> 4 12 16.6 28135
#> 5 2 10.8 24951
#> 6 3 13.2 28834
#> # ℹ 6 more rowsOr if you want to group by multiple variables:
flights %>%
summarize(
delay = mean(dep_delay, na.rm = TRUE),
n = n(),
.by = c(origin, dest)
)
#> # A tibble: 224 × 4
#> origin dest delay n
#> <chr> <chr> <dbl> <int>
#> 1 EWR IAH 11.8 3973
#> 2 LGA IAH 9.06 2951
#> 3 JFK MIA 9.34 3314
#> 4 JFK BQN 6.67 599
#> 5 LGA ATL 11.4 10263
#> 6 EWR ORD 14.6 6100
#> # ℹ 218 more rowsNote that you can use the .by argument with all the verbs in the dplyr toolkit. It can be very useful and may be more convenient to use than group_by().
Exercises
- Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about
flights %>% group_by(carrier, dest) %>% summarize(n())) - Find the flights that are most delayed upon departure from each destination.
- How do delays vary over the course of the day? Illustrate your answer with a plot.
- What happens if you supply a negative
ntoslice_min()and friends? - Explain what
count()does in terms of thedplyrverbs you just learned. What does thesortargument tocount()do?
Summary
In this tutorial, we learned the tools that dplyr provides for working with data frames. There are roughly three categories of tools:
- Those that manipulate rows (e.g.
filter()andarrange()) - Those that manipulate columns (e.g.
select()andmutate()) - Those that manipulate groups (e.g.
group_by()andsummarize(), orsummarize()with the.byargument)
In the next two tutorials, we’ll talk about the steps that lead up to the data transformation step: data importing and data tidying.