Yes, there was a flight on everyday of 2013 since the output shows that after removing the duplicated data, we have 365 rows which correspond to the 365 days of the year.
6. Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
The order of execution matters because arrange() operates on the filtered data.
If you filter first and then arrange,arrange() will sort only the filtered subset.
If you arrange first and then filter,filter() will operate on the original data, and the subsequent arrange() will sort the entire dataset.
The order affects computational efficiency:
Filtering first reduces the dataset size, potentially speeding up subsequent operations.
Arranging first sorts the entire dataset, which can be computationally expensive.
If you filter first, you work with a smaller subset during sorting.
If you arrange first, you sort the entire dataset and then filter, so more work for the function.
Exercise 3.3.5
1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
dep-delay is the result of a difference between dep_time and sched_dep_time. When dep_time increases and sched_dep_time decreases, dep_delay increases, and vice versa.
3. What happens if you specify the name of the same variable multiple times in a select() call?
In dplyr, if we specify the same variable name multiple times in a select() call, only one copy of the variable will appear in the resulting data frame.
any_of() is especially useful to remove variables from a data frame because calling it again does not cause an error. Using any_of(variables) would be helpful because it allows you to select columns from the flights data frame based on the vector variables, which contains the column names “year”, “month”, “day”, “dep_delay”, and “arr_delay”. This approach simplifies the selection process and makes the code more concise.
5. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
No, it does not surprise because the output displays all the columns which has the word “time”. By default, select(contains) is case-insensitive. To change it, we should set the argument ignore.case = FALSE , and nothing will appear because there is no uppercase column name in our dataset.
The error occurs because the column name arr_delay is not found after selecting only the tailnum column. To fix it we should include the arr_delay column in the selection before arranging.
1. Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))
`summarise()` has grouped output by 'carrier'. You can override using the
`.groups` argument.
# A tibble: 314 × 3
# Groups: carrier [16]
carrier dest avg_delay
<chr> <chr> <dbl>
1 UA STL 77.5
2 OO ORD 67
3 OO DTW 61
4 UA RDU 60
5 EV PBI 48.7
6 EV TYS 41.8
7 EV CAE 36.7
8 EV TUL 34.9
9 9E BGR 34
10 WN MSY 33.4
# ℹ 304 more rows
United Airlines has the worst average delays, but only appears 2 times on the top 10 worst when ExpressJet appears 4 times . While average delays among carriers highlight trends, disentangling the effects of bad airports versus bad carriers is challenging due to complex interactions, shared routes, data limitations, operational strategies, and mitigation efforts.
2. Find the flights that are most delayed upon departure from each destination.
flights |>group_by(dest) |>slice_max(dep_delay, n =1, with_ties =FALSE) |>relocate(dest) |>arrange(desc(dep_delay))
How do delays vary over the course of the day. Illustrate your answer with a plot.
There are less delays before 5 am. There are some delays that are above and beyond the rest. The longest delay is observed at 7 am. Delays increase a little bit over the day, and there are more delays during night times.
ggplot(data = flights) +geom_point(mapping =aes(x= dep_time, y = dep_delay, na.rm =TRUE))+labs(x="Departure Time", y ="Departure Delay(minutes)")
Warning in geom_point(mapping = aes(x = dep_time, y = dep_delay, na.rm =
TRUE)): Ignoring unknown aesthetics: na.rm
It will create a 3x5 data frame in the environment with variables x, y, and z. x will be the numbers 1-5, y will be the characters a, b, a, a, b, and z will be the characters K, K, L, L, K; in corresponding order with x.
# A tibble: 5 × 3
# Groups: y [2]
x y z
<int> <chr> <chr>
1 1 a K
2 2 b K
3 3 a L
4 4 a L
5 5 b K
b. Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a)?
arrange()will order rows based on the value of y column in a 5x3 tibble.
df |>arrange(y)
# A tibble: 5 × 3
x y z
<int> <chr> <chr>
1 1 a K
2 3 a L
3 4 a L
4 2 b K
5 5 b K
It sorts rows by ascending order (smallest to biggest), but if we precise it can also sort rows by descending order. It is different from the group_by in part (a) since it does not creates groups of the variable values.
c. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
df |> group_by(y) |> summarize(mean_x = mean(x))
The output will produce a 2x2 tibble with the mean x for each group of y.
df |>group_by(y) |>summarize(mean_x =mean(x))
# A tibble: 2 × 2
y mean_x
<chr> <dbl>
1 a 2.67
2 b 3.5
The data pipeline connects an action (verb) to a dataframe (df). In this case, the action is to arrange the data in the dataframe based on the variable y. The pipeline first groups the data by the variable y, and then calculates the mean of x within each group of y. Specifically, the mean value of x for group a is lower than the mean value of x for group b.
d. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.
The output will be a 3x3 tibble containing the mean x for each combination of y and z.
df |>group_by(y, z) |>summarize(mean_x =mean(x))
`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
# A tibble: 3 × 3
# Groups: y [2]
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
The pipeline groups the information in df by the values of y and z, then produces the mean x for each group of y and z values.
e. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).
# A tibble: 3 × 3
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
The pipeline groups the information in df by the values of y and z, then produces the mean x for each group of y and z values, and ungroups the data. This output is different from (d) because it produces the same summary statistics as (d), but without displaying the grouping information in the output. It retains the columns y, z, and mean_x without any grouping details.
f. Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?
The first output will be a 3x3 tibble with the mean x for each combination of y and z (same as (d)).
The second output will be a 3x4 tibble with an extra column titled mean_x in addition to the variable x.
df |>group_by(y, z) |>summarize(mean_x =mean(x))
`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
# A tibble: 3 × 3
# Groups: y [2]
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
The pipeline groups df by y and z and then produces the mean x.
df |>group_by(y, z) |>mutate(mean_x =mean(x))
# A tibble: 5 × 4
# Groups: y, z [3]
x y z mean_x
<int> <chr> <chr> <dbl>
1 1 a K 1
2 2 b K 3.5
3 3 a L 3.5
4 4 a L 3.5
5 5 b K 3.5
The pipeline groups df by y and z, then creates a new column to produce a 5x4 tibble with the mean x for each combination of x, y, and z.
This output is different because using the mutate function we create a new column that retains the original x column, meaning that each observation remains separate plus the newly added column that stores the calculation of mean_x.