Here are several ways to select dep_time,
dep_delay, arr_time, and
arr_delay:
# Method 1: List them directly
flights |> select(dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
# Method 2: Use contains() with "delay" and "time"
flights |> select(contains("delay"), contains("time"))
## # A tibble: 336,776 × 8
## dep_delay arr_delay dep_time sched_dep_time arr_time sched_arr_time air_time
## <dbl> <dbl> <int> <int> <int> <int> <dbl>
## 1 2 11 517 515 830 819 227
## 2 4 20 533 529 850 830 227
## 3 2 33 542 540 923 850 160
## 4 -1 -18 544 545 1004 1022 183
## 5 -6 -25 554 600 812 837 116
## 6 -4 12 554 558 740 728 150
## 7 -5 19 555 600 913 854 158
## 8 -3 -14 557 600 709 723 53
## 9 -3 -8 557 600 838 846 140
## 10 -2 8 558 600 753 745 138
## # ℹ 336,766 more rows
## # ℹ 1 more variable: time_hour <dttm>
# Method 3: Use starts_with() for "dep" and "arr"
flights |> select(starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
# Method 4: Use ends_with() for "time" and "delay"
flights |> select(ends_with("time"), ends_with("delay"))
## # A tibble: 336,776 × 7
## dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay arr_delay
## <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 517 515 830 819 227 2 11
## 2 533 529 850 830 227 4 20
## 3 542 540 923 850 160 2 33
## 4 544 545 1004 1022 183 -1 -18
## 5 554 600 812 837 116 -6 -25
## 6 554 558 740 728 150 -4 12
## 7 555 600 913 854 158 -5 19
## 8 557 600 709 723 53 -3 -14
## 9 557 600 838 846 140 -3 -8
## 10 558 600 753 745 138 -2 8
## # ℹ 336,766 more rows
# Method 5: Use a combination of methods
flights |> select(dep_time:dep_delay, arr_time:arr_delay)
## # A tibble: 336,776 × 6
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2 830 819 11
## 2 533 529 4 850 830 20
## 3 542 540 2 923 850 33
## 4 544 545 -1 1004 1022 -18
## 5 554 600 -6 812 837 -25
## 6 554 558 -4 740 728 12
## 7 555 600 -5 913 854 19
## 8 557 600 -3 709 723 -14
## 9 557 600 -3 838 846 -8
## 10 558 600 -2 753 745 8
## # ℹ 336,766 more rows
### Method 6: Direct column selection, is most straightforward but requires typing all variable names
flights |> select(dep_time, dep_delay, arr_time, arr_delay) |> head()
## # A tibble: 6 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
# Method 7: Using a vector with all_of(), is good when you have a predefined list of variables
vars <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
flights |> select(all_of(vars)) |> head()
## # A tibble: 6 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
# Method 8: Using matches (regular expression), is powerful but requires understanding regular expressions
flights |> select(matches("^(dep|arr)_(time|delay)$")) |> head()
## # A tibble: 6 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
select() call?# Selecting the same variable multiple times
flights |> select(dep_time, dep_time, dep_delay) |> head()
## # A tibble: 6 × 2
## dep_time dep_delay
## <int> <dbl>
## 1 517 2
## 2 533 4
## 3 542 2
## 4 544 -1
## 5 554 -6
## 6 554 -4
If you specify the same variable multiple times in
select(), only the first occurrence is kept. The others are
ignored. This is actually helpful because it means you can use multiple
selection methods without worrying about duplicates. This behavior helps
ensure tidy and clean data frames without duplicated columns.
any_of() function do? Why might it be
helpful in conjunction with this vector?The any_of() function is useful when you have a vector
of variable names and want to select any variables that exist in your
data frame. It’s helpful because:
The any_of() function in dplyr allows you to select only
those columns from a character vector that exist in the data frame. If
one or more column names in the vector are not found,
any_of() will silently ignore them, rather than throwing an
error.
This is very useful when:
it working with data that may or may not contain all the columns in your list. It won’t error if some variables don’t exist
We can write reusable or flexible code for multiple datasets. It’s more concise than listing variables individually
It allowing users to define their own variable lists safely.
For example, with the vector:
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights |> select(any_of(variables))
## # A tibble: 336,776 × 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ℹ 336,766 more rows
Using any_of(variables) ensures that if one of those columns (e.g., arr_delay) is missing in the dataset, your code will still run without errors.
Yes, the result might be surprising if you expect to see columns like
dep_time, arr_time, or air_time.
The reason is: - The contains() helper is
case-sensitive by default. - "TIME" (all
uppercase) does not match any column names like
dep_time or arr_time because those are
lowercase.
To make it case-sensitive, you can use the
ignore.case = FALSE argument:
flights |> select(contains("TIME", ignore.case = TRUE))
## # A tibble: 336,776 × 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # ℹ 336,766 more rows
The returns no columns, even though we know that flights includes columns like dep_time, arr_time, and air_time.
# Check which columns match "TIME" (all uppercase)
flights |> select(contains("TIME")) |> names()
## [1] "dep_time" "sched_dep_time" "arr_time" "sched_arr_time"
## [5] "air_time" "time_hour"
# Now using case-insensitive search
flights |> select(contains("TIME", ignore.case = TRUE)) |> names()
## [1] "dep_time" "sched_dep_time" "arr_time" "sched_arr_time"
## [5] "air_time" "time_hour"
Yes, the result might be surprising at first — the code
To rename air_time to air_time_min and move
it to the beginning:
# Rename air_time and move it to the first column
flights |>
rename(air_time_min = air_time) |>
relocate(air_time_min) |>
select(1:6) |>
head()
## # A tibble: 6 × 6
## air_time_min year month day dep_time sched_dep_time
## <dbl> <int> <int> <int> <int> <int>
## 1 227 2013 1 1 517 515
## 2 227 2013 1 1 533 529
## 3 160 2013 1 1 542 540
## 4 183 2013 1 1 544 545
## 5 116 2013 1 1 554 600
## 6 150 2013 1 1 554 558
We use rename() to rename the column air_time to air_time_min to clarify that the unit is minutes.
Then, we use relocate() to move this column to the front of the dataset.
Explanation:
rename(air_time_min = air_time) renames the column.
relocate(air_time_min) places it before all other columns (by default).
We use select(1:6) and head() just to show a sample of the result clearly.
flights |>
select(tailnum) |>
arrange(arr_delay)
#> Error in arrange():
#> ℹ In argument: ..1 = arr_delay.
#> Caused by error:
#> ! object ‘arr_delay’ not found
Explanation of the Error: The error occurs because arr_delay is not in the dataset anymore after using:
# Fix: include arr_delay before arranging
flights |>
select(tailnum, arr_delay) |>
arrange(arr_delay) |>
head()
## # A tibble: 6 × 2
## tailnum arr_delay
## <chr> <dbl>
## 1 N843VA -86
## 2 N840VA -79
## 3 N851UA -75
## 4 N3KCAA -75
## 5 N551AS -74
## 6 N24212 -73
# Fix: Arrange before selecting
flights |>
arrange(arr_delay) |>
select(tailnum)
## # A tibble: 336,776 × 1
## tailnum
## <chr>
## 1 N843VA
## 2 N840VA
## 3 N851UA
## 4 N3KCAA
## 5 N551AS
## 6 N24212
## 7 N3760C
## 8 N806UA
## 9 N805JB
## 10 N855VA
## # ℹ 336,766 more rows
This removes all columns except tailnum. When arrange(arr_delay) tries to sort by arr_delay, it cannot find it — because it was never included in the current data frame.
Error message explained: object ‘arr_delay’ not found: this means the column I want to sort by does not exist in the current pipeline result.
The error occurs because we first selected only the
tailnum column, so arr_delay no longer exists
in the data frame when I try to arrange by it. The error message tells
me that arr_delay wasn’t found.
To fix this, we need to either: 1. Include arr_delay in
our select statement, or 2. Do the arrange() before the select()