6.Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.
Although it makes no difference in theory, the most effective approach is to filter before arranging so that everything is already in order and you don’t have to do it again. If you arrange before filtering, however, the arrangement might be disorganized depending on the filter, and you would have to arrange again.
3.3.5 Problems
1.Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
Using the scheduled departure time and the actual departure time, we can figure out what the departure delay is.
2.Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
By selecting columns, we are able to select all these choices at once.
3.What happens if you specify the name of the same variable multiple times in a select() call?
The select() function will only take the first function it is used on and ignore any other duplication of the function.
4.What does the any_of() function do? Why might it be helpful in conjunction with this vector?
The any_of() function can be used to check if there are any missing characters. If there is, an error will occur.
5.Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
7.Why doesn’t the following work, and what does the error mean?
#flights |> #select(tailnum) |> #arrange(arr_delay)#> Error in `arrange()`:#> ℹ In argument: `..1 = arr_delay`.#> Caused by error:#> ! object 'arr_delay' not found
Arr_delay cannot be arranged since this tibble just contains tailnum. (I edited this one out but I took a screenshot, I did this so I could render the file)
3.5.7 Probelems
1.Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))
All rows are shown instead of just the highest or lowest value
5.Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?
We can count the number of occurrences in each category with the help of the count() function. We may arrange the data to be at the top or bottom (highest or lowest) using the sort() method.
6.Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?
Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.
The output will be a 3 by 5 data frame.
df |>group_by(y)
# A tibble: 5 × 3
# Groups: y [2]
x y z
<int> <chr> <chr>
1 1 a K
2 2 b K
3 3 a L
4 4 a L
5 5 b K
The chosen variables are grouped by the group_by() method.
Describe what arrange() does and write down your prediction for the output’s appearance. Then, verify that your prediction was accurate. Additionally, what distinguishes it from the group_by() in portion (a)?
It appears to me that the observations will be sorted in the 3 by 5 tibble according to their values.
df |>arrange(y)
# A tibble: 5 × 3
x y z
<int> <chr> <chr>
1 1 a K
2 3 a L
3 4 a L
4 2 b K
5 5 b K
Arrange() allows you to sort in either ascending or descending order. This differs from group_by() in that the result is not sorted according to the value of a variable.
Describe what the pipeline performs and write down your prediction for the output’s appearance. Then, see whether you were right.
I predict that the result will be a 2 by 2 triangle.
df |>group_by(y) |>summarize(mean_x =mean(x))
# A tibble: 2 × 2
y mean_x
<chr> <dbl>
1 a 2.67
2 b 3.5
The pipeline function simultaneously sets several parameters and assembles many steps sequentially.
Describe what the pipeline performs and write down your prediction for the output’s appearance. Then, see whether you were right. After that, discuss what the message says.
A 3 by 3 tibble, I believe, will be the result.
df |>group_by(y, z) |>summarize(mean_x =mean(x))
`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
# A tibble: 3 × 3
# Groups: y [2]
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
The pipeline produced the mean of x for each group of data, y and z.
Describe what the pipeline performs and write down your prediction for the output’s appearance. Then, verify your prediction. What distinguishes the output from that in section (d)?
The outcome, in my opinion, will be the same as part D.
# A tibble: 3 × 3
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
Because the dataframe was ungrouped by grouping without include a single variable, this portion was comparable to part D.
Describe the functions of each pipeline and write down your predictions for the outputs. Then, verify your predictions. What differences exist between the two pipelines’ outputs?
It appears that a 3 by 3 tibble will be produced by the first code, and a 3 by 4 tibble by the second.
df |>group_by(y, z) |>summarize(mean_x =mean(x))
`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
# A tibble: 3 × 3
# Groups: y [2]
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
df |>group_by(y, z) |>mutate(mean_x =mean(x))
# A tibble: 5 × 4
# Groups: y, z [3]
x y z mean_x
<int> <chr> <chr> <dbl>
1 1 a K 1
2 2 b K 3.5
3 3 a L 3.5
4 4 a L 3.5
5 5 b K 3.5
The mutate function was used in one of the outputs, adding an additional column that indicates each observation is distinct, which is why the two outputs are different.