1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

Looking at these three variables:

sched_dep_time is the scheduled departure time, the time the flight was planned to take off.
dep_time is the actual departure time, when the flight really left the gate.
dep_delay is the departure delay in minutes, indicating how early or late the flight departed compared to the schedule.

The relationship should be:

dep_delay = dep_time - sched_dep_time (in minutes)

However, both dep_time and sched_dep_time are recorded in HHMM format (e.g., 515 means 5:15 AM), so subtracting them directly does not accurately give the delay in minutes. For this reason, dep_delay is more reliable, as it is already calculated in minutes.

sched_dep_time is 515 (5:15 AM)
dep_time is 517 (5:17 AM)
dep_delay is 2, which makes sense.

This shows that the flight departed 2 minutes later than scheduled.

This is why dep_delay is more reliable - it’s already calculated in minutes, taking into account the HHMM format.

2. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

Here are several ways to select dep_time, dep_delay, arr_time, and arr_delay:

# Method 1: List them directly
flights |> select(dep_time, dep_delay, arr_time, arr_delay)

## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows

# Method 2: Use contains() with "delay" and "time"
flights |> select(contains("delay"), contains("time"))

## # A tibble: 336,776 × 8
##    dep_delay arr_delay dep_time sched_dep_time arr_time sched_arr_time air_time
##        <dbl>     <dbl>    <int>          <int>    <int>          <int>    <dbl>
##  1         2        11      517            515      830            819      227
##  2         4        20      533            529      850            830      227
##  3         2        33      542            540      923            850      160
##  4        -1       -18      544            545     1004           1022      183
##  5        -6       -25      554            600      812            837      116
##  6        -4        12      554            558      740            728      150
##  7        -5        19      555            600      913            854      158
##  8        -3       -14      557            600      709            723       53
##  9        -3        -8      557            600      838            846      140
## 10        -2         8      558            600      753            745      138
## # ℹ 336,766 more rows
## # ℹ 1 more variable: time_hour <dttm>

# Method 3: Use starts_with() for "dep" and "arr"
flights |> select(starts_with("dep"), starts_with("arr"))

## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ℹ 336,766 more rows

# Method 4: Use ends_with() for "time" and "delay"
flights |> select(ends_with("time"), ends_with("delay"))

## # A tibble: 336,776 × 7
##    dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay arr_delay
##       <int>          <int>    <int>          <int>    <dbl>     <dbl>     <dbl>
##  1      517            515      830            819      227         2        11
##  2      533            529      850            830      227         4        20
##  3      542            540      923            850      160         2        33
##  4      544            545     1004           1022      183        -1       -18
##  5      554            600      812            837      116        -6       -25
##  6      554            558      740            728      150        -4        12
##  7      555            600      913            854      158        -5        19
##  8      557            600      709            723       53        -3       -14
##  9      557            600      838            846      140        -3        -8
## 10      558            600      753            745      138        -2         8
## # ℹ 336,766 more rows

# Method 5: Use a combination of methods
flights |> select(dep_time:dep_delay, arr_time:arr_delay)

## # A tibble: 336,776 × 6
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>
##  1      517            515         2      830            819        11
##  2      533            529         4      850            830        20
##  3      542            540         2      923            850        33
##  4      544            545        -1     1004           1022       -18
##  5      554            600        -6      812            837       -25
##  6      554            558        -4      740            728        12
##  7      555            600        -5      913            854        19
##  8      557            600        -3      709            723       -14
##  9      557            600        -3      838            846        -8
## 10      558            600        -2      753            745         8
## # ℹ 336,766 more rows

### Method 6: Direct column selection， is most straightforward but requires typing all variable names
flights |> select(dep_time, dep_delay, arr_time, arr_delay) |> head()

## # A tibble: 6 × 4
##   dep_time dep_delay arr_time arr_delay
##      <int>     <dbl>    <int>     <dbl>
## 1      517         2      830        11
## 2      533         4      850        20
## 3      542         2      923        33
## 4      544        -1     1004       -18
## 5      554        -6      812       -25
## 6      554        -4      740        12

# Method 7: Using a vector with all_of()， is good when you have a predefined list of variables
vars <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
flights |> select(all_of(vars)) |> head()

## # A tibble: 6 × 4
##   dep_time dep_delay arr_time arr_delay
##      <int>     <dbl>    <int>     <dbl>
## 1      517         2      830        11
## 2      533         4      850        20
## 3      542         2      923        33
## 4      544        -1     1004       -18
## 5      554        -6      812       -25
## 6      554        -4      740        12

# Method 8: Using matches (regular expression)， is powerful but requires understanding regular expressions
flights |> select(matches("^(dep|arr)_(time|delay)$")) |> head()

## # A tibble: 6 × 4
##   dep_time dep_delay arr_time arr_delay
##      <int>     <dbl>    <int>     <dbl>
## 1      517         2      830        11
## 2      533         4      850        20
## 3      542         2      923        33
## 4      544        -1     1004       -18
## 5      554        -6      812       -25
## 6      554        -4      740        12

3. What happens if you specify the name of the same variable multiple times in a `select()` call?

# Selecting the same variable multiple times
flights |> select(dep_time, dep_time, dep_delay) |> head()

## # A tibble: 6 × 2
##   dep_time dep_delay
##      <int>     <dbl>
## 1      517         2
## 2      533         4
## 3      542         2
## 4      544        -1
## 5      554        -6
## 6      554        -4

If you specify the same variable multiple times in select(), only the first occurrence is kept. The others are ignored. This is actually helpful because it means you can use multiple selection methods without worrying about duplicates. This behavior helps ensure tidy and clean data frames without duplicated columns.

4. What does the `any_of()` function do? Why might it be helpful in conjunction with this vector?

The any_of() function is useful when you have a vector of variable names and want to select any variables that exist in your data frame. It’s helpful because:

The any_of() function in dplyr allows you to select only those columns from a character vector that exist in the data frame. If one or more column names in the vector are not found, any_of() will silently ignore them, rather than throwing an error.

This is very useful when:

it working with data that may or may not contain all the columns in your list. It won’t error if some variables don’t exist
We can write reusable or flexible code for multiple datasets. It’s more concise than listing variables individually
It allowing users to define their own variable lists safely.

For example, with the vector:

variables <- c("year", "month", "day", "dep_delay", "arr_delay")
flights |> select(any_of(variables))

## # A tibble: 336,776 × 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # ℹ 336,766 more rows

Using any_of(variables) ensures that if one of those columns (e.g., arr_delay) is missing in the dataset, your code will still run without errors.

5. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

????????????????????????????????????

Yes, the result might be surprising if you expect to see columns like dep_time, arr_time, or air_time. The reason is: - The contains() helper is case-sensitive by default. - "TIME" (all uppercase) does not match any column names like dep_time or arr_time because those are lowercase.

To make it case-sensitive, you can use the ignore.case = FALSE argument:

flights |> select(contains("TIME", ignore.case = TRUE))

## # A tibble: 336,776 × 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
##       <int>          <int>    <int>          <int>    <dbl> <dttm>             
##  1      517            515      830            819      227 2013-01-01 05:00:00
##  2      533            529      850            830      227 2013-01-01 05:00:00
##  3      542            540      923            850      160 2013-01-01 05:00:00
##  4      544            545     1004           1022      183 2013-01-01 05:00:00
##  5      554            600      812            837      116 2013-01-01 06:00:00
##  6      554            558      740            728      150 2013-01-01 05:00:00
##  7      555            600      913            854      158 2013-01-01 06:00:00
##  8      557            600      709            723       53 2013-01-01 06:00:00
##  9      557            600      838            846      140 2013-01-01 06:00:00
## 10      558            600      753            745      138 2013-01-01 06:00:00
## # ℹ 336,766 more rows

The returns no columns, even though we know that flights includes columns like dep_time, arr_time, and air_time.

# Check which columns match "TIME" (all uppercase)
flights |> select(contains("TIME")) |> names()

## [1] "dep_time"       "sched_dep_time" "arr_time"       "sched_arr_time"
## [5] "air_time"       "time_hour"

# Now using case-insensitive search
flights |> select(contains("TIME", ignore.case = TRUE)) |> names()

## [1] "dep_time"       "sched_dep_time" "arr_time"       "sched_arr_time"
## [5] "air_time"       "time_hour"

Yes, the result might be surprising at first — the code

6. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

To rename air_time to air_time_min and move it to the beginning:

# Rename air_time and move it to the first column
flights |> 
  rename(air_time_min = air_time) |> 
  relocate(air_time_min) |> 
  select(1:6) |> 
  head()

## # A tibble: 6 × 6
##   air_time_min  year month   day dep_time sched_dep_time
##          <dbl> <int> <int> <int>    <int>          <int>
## 1          227  2013     1     1      517            515
## 2          227  2013     1     1      533            529
## 3          160  2013     1     1      542            540
## 4          183  2013     1     1      544            545
## 5          116  2013     1     1      554            600
## 6          150  2013     1     1      554            558

We use rename() to rename the column air_time to air_time_min to clarify that the unit is minutes.

Then, we use relocate() to move this column to the front of the dataset.

Explanation:

rename(air_time_min = air_time) renames the column.

relocate(air_time_min) places it before all other columns (by default).

We use select(1:6) and head() just to show a sample of the result clearly.

7. Why doesn’t the following work, and what does the error mean?

flights |>

select(tailnum) |>

arrange(arr_delay)

#> Error in arrange():

#> ℹ In argument: ..1 = arr_delay.

#> Caused by error:

#> ! object ‘arr_delay’ not found

Explanation of the Error: The error occurs because arr_delay is not in the dataset anymore after using:

# Fix: include arr_delay before arranging
flights |> 
  select(tailnum, arr_delay) |> 
  arrange(arr_delay) |> 
  head()

## # A tibble: 6 × 2
##   tailnum arr_delay
##   <chr>       <dbl>
## 1 N843VA        -86
## 2 N840VA        -79
## 3 N851UA        -75
## 4 N3KCAA        -75
## 5 N551AS        -74
## 6 N24212        -73

# Fix: Arrange before selecting
flights |> 
  arrange(arr_delay) |> 
  select(tailnum)

## # A tibble: 336,776 × 1
##    tailnum
##    <chr>  
##  1 N843VA 
##  2 N840VA 
##  3 N851UA 
##  4 N3KCAA 
##  5 N551AS 
##  6 N24212 
##  7 N3760C 
##  8 N806UA 
##  9 N805JB 
## 10 N855VA 
## # ℹ 336,766 more rows

This removes all columns except tailnum. When arrange(arr_delay) tries to sort by arr_delay, it cannot find it — because it was never included in the current data frame.

Error message explained: object ‘arr_delay’ not found: this means the column I want to sort by does not exist in the current pipeline result.

The error occurs because we first selected only the tailnum column, so arr_delay no longer exists in the data frame when I try to arrange by it. The error message tells me that arr_delay wasn’t found.

To fix this, we need to either: 1. Include arr_delay in our select statement, or 2. Do the arrange() before the select()

Flight Data Analysis Answers

Your Name

2025-05-26

2. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

3. What happens if you specify the name of the same variable multiple times in a `select()` call?

4. What does the `any_of()` function do? Why might it be helpful in conjunction with this vector?

5. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

????????????????????????????????????

6. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

7. Why doesn’t the following work, and what does the error mean?

Flight Data Analysis Answers

Your Name

2025-05-26

1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

2. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

3. What happens if you specify the name of the same variable multiple times in a select() call?

4. What does the any_of() function do? Why might it be helpful in conjunction with this vector?

5. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

????????????????????????????????????

6. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

7. Why doesn’t the following work, and what does the error mean?

3. What happens if you specify the name of the same variable multiple times in a `select()` call?

4. What does the `any_of()` function do? Why might it be helpful in conjunction with this vector?