suppressPackageStartupMessages(library(nycflights13))
package 㤼㸱nycflights13㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library(tidyverse))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3

1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

To get the departure times in the number of minutes, divide dep_time by 100 to get the hours since midnight and multiply by 60 and add the remainder of dep_time divided by 100. For example, 1504 represents 15:04 (or 3:04 PM), which is 904 minutes after midnight. To generalize this approach, we need a way to split out the hour-digits from the minute-digits. Dividing by 100 and discarding the remainder using the integer division operator, %/% gives us the following.

1504 %/% 100
[1] 15

Instead of %/% could also use / along with trunc() or floor(), but round() would not work. To get the minutes, instead of discarding the remainder of the division by 100, we only want the remainder. So we use the modulo operator, %%.

1504 %% 100
[1] 4

Now, we can combine the hours (multiplied by 60 to convert them to minutes) and minutes to get the number of minutes after midnight.

1504 %/% 100 * 60 + 1504 %% 100
[1] 904

There is one remaining issue. Midnight is represented by 2400, which would correspond to 1440 minutes since midnight, but it should correspond to 0. After converting all the times to minutes after midnight, x %% 1440 will convert 1440 to zero while keeping all the other times the same.

Now we will put it all together. The following code creates a new data frame flights_times with columns dep_time_mins and sched_dep_time_mins. These columns convert dep_time and sched_dep_time, respectively, to minutes since midnight.

flights_times <- mutate(flights,
  dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
  sched_dep_time_mins = (sched_dep_time %/% 100 * 60 +
    sched_dep_time %% 100) %% 1440
)
# view only relevant columns
select(
  flights_times, dep_time, dep_time_mins, sched_dep_time,
  sched_dep_time_mins
)

When we introduce functions, you’ll understand this is precisely the sort of situation where writing a function will allow us to avoid copying and pasting code. We could define a function time2mins(), which converts a vector of times in from the format used in flights to minutes since midnight.

time2mins <- function(x) {
  (x %/% 100 * 60 + x %% 100) %% 1440
}

Using time2mins, the previous code simplifies to the following.

flights_times <- mutate(flights,
  dep_time_mins = time2mins(dep_time),
  sched_dep_time_mins = time2mins(sched_dep_time)
)
# show only the relevant columns
select(
  flights_times, dep_time, dep_time_mins, sched_dep_time,
  sched_dep_time_mins
)

2. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

I expect that air_time is the difference between the arrival (arr_time) and departure times (dep_time). In other words, air_time = arr_time - dep_time.

To check that this relationship, I’ll first need to convert the times to a form more amenable to arithmetic operations using the same calculations as the previous exercise.

flights_airtime <-
  mutate(flights,
    dep_time = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
    arr_time = (arr_time %/% 100 * 60 + arr_time %% 100) %% 1440,
    air_time_diff = air_time - arr_time + dep_time
  )

So, does air_time = arr_time - dep_time? If so, there should be no flights with non-zero values of air_time_diff.

nrow(filter(flights_airtime, air_time_diff != 0))
[1] 327150

It turns out that there are many flights for which air_time != arr_time - dep_time. Other than data errors, I can think of two reasons why air_time would not equal arr_time - dep_time.

  1. The flight passes midnight, so arr_time < dep_time. In these cases, the difference in airtime should be by 24 hours (1,440 minutes).

  2. The flight crosses time zones, and the total air time will be off by hours (multiples of 60). All flights in flights departed from New York City and are domestic flights in the US. This means that flights will all be to the same or more westerly time zones. Given the time-zones in the US, the differences due to time-zone should be 60 minutes (Central) 120 minutes (Mountain), 180 minutes (Pacific), 240 minutes (Alaska), or 300 minutes (Hawaii).

Both of these explanations have clear patterns that I would expect to see if they were true. In particular, in both cases, since time-zones and crossing midnight only affects the hour part of the time, all values of air_time_diff should be divisible by 60. I’ll visually check this hypothesis by plotting the distribution of air_time_diff. If those two explanations are correct, distribution of air_time_diff should comprise only spikes at multiples of 60.

ggplot(flights_airtime, aes(x = air_time_diff)) +
  geom_histogram(binwidth = 1)

This is not the case. While, the distribution of air_time_diff has modes at multiples of 60 as hypothesized, it shows that there are many flights in which the difference between air time and local arrival and departure times is not divisible by 60.

Let’s also look at flights with Los Angeles as a destination. The discrepancy should be 180 minutes.

ggplot(filter(flights_airtime, dest == "LAX"), aes(x = air_time_diff)) +
  geom_histogram(binwidth = 1)

To fix these time-zone issues, I would want to convert all the times to a date-time to handle overnight flights, and from local time to a common time zone, most likely UTC, to handle flights crossing time-zones. The tzone column of nycflights13::airports gives the time-zone of each airport. We’ll talk a lot more about dates and times later.

But that still leaves the other differences unexplained. So what else might be going on? There seem to be too many problems for this to be data entry problems, so I’m probably missing something. So, I’ll reread the documentation to make sure that I understand the definitions of arr_time, dep_time, and air_time. The documentation contains a link to the source of the flights data, https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236. This documentation shows that the flights data does not contain the variables TaxiIn, TaxiOff, WheelsIn, and WheelsOff. It appears that the air_time variable refers to flight time, which is defined as the time between wheels-off (take-off) and wheels-in (landing). But the flight time does not include time spent on the runway taxiing to and from gates. With this new understanding of the data, I now know that the relationship between air_time, arr_time, and dep_time is air_time <= arr_time - dep_time, supposing that the time zones of arr_time and dep_time are in the same time zone.

4 Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

The dplyr package provides multiple functions for ranking, which differ in how they handle tied values: row_number(), min_rank(), dense_rank(). To see how they work, let’s create a data frame with duplicate values in a vector and see how ranking functions handle ties.

rankme <- tibble(
  x = c(10, 5, 1, 5, 5)
)
rankme <- mutate(rankme,
  x_row_number = row_number(x),
  x_min_rank = min_rank(x),
  x_dense_rank = dense_rank(x)
)
arrange(rankme, x)

The function row_number() assigns each element a unique value. The result is equivalent to the index (or row) number of each element after sorting the vector, hence its name.

The min_rank() and dense_rank() assign tied values the same rank, but differ in how they assign values to the next rank. For each set of tied values the min_rank() function assigns a rank equal to the number of values less than that tied value plus one. In contrast, the dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one. To see the difference between dense_rank() and min_rank() compare the value of rankme$x_min_rank and rankme$x_dense_rank for x = 10.

If I had to choose one for presenting rankings to someone else, I would use min_rank() since its results correspond to the most common usage of rankings in sports or other competitions. In the code below, I use all three functions, but since there are no ties in the top 10 flights, the results don’t differ.

flights_delayed <- mutate(flights,
  dep_delay_min_rank = min_rank(desc(dep_delay)),
  dep_delay_row_number = row_number(desc(dep_delay)),
  dep_delay_dense_rank = dense_rank(desc(dep_delay))
)
flights_delayed <- filter(
  flights_delayed,
  !(dep_delay_min_rank > 10 | dep_delay_row_number > 10 |
    dep_delay_dense_rank > 10)
)
flights_delayed <- arrange(flights_delayed, dep_delay_min_rank)
print(select(
  flights_delayed, month, day, carrier, flight, dep_delay,
  dep_delay_min_rank, dep_delay_row_number, dep_delay_dense_rank
),
n = Inf
)

In addition to the functions covered here, the rank() function provides several more ways of ranking elements.

5 What does 1:3 + 1:10 return? Why?

The code given in the question returns the following.

1:3 + 1:10
longer object length is not a multiple of shorter object length
 [1]  2  4  6  5  7  9  8 10 12 11

This is equivalent to the following.

c(1 + 1, 2 + 2, 3 + 3, 1 + 4, 2 + 5, 3 + 6, 1 + 7, 2 + 8, 3 + 9, 1 + 10)
 [1]  2  4  6  5  7  9  8 10 12 11

When adding two vectors recycles the shorter vector’s values to get vectors of the same length.

The code also produces a warning that the shorter vector is not a multiple of the longer vector. A warning is provided since often, but not always, this indicates a bug in the code.

6 What trigonometric functions does R provide?

All trigonometric functions are all described in a single help page, named Trig. You can open the documentation for these functions with ?Trig or by using ? with any of the following functions, for example: ?sin.

R provides functions for the three primary trigonometric functions: sine (sin()), cosine (cos()), and tangent (tan()). The input angles to all these functions are in radians.

x <- seq(-3, 7, by = 1 / 2)
sin(pi * x)
 [1] -3.673819e-16 -1.000000e+00  2.449213e-16  1.000000e+00 -1.224606e-16
 [6] -1.000000e+00  0.000000e+00  1.000000e+00  1.224606e-16 -1.000000e+00
[11] -2.449213e-16  1.000000e+00  3.673819e-16 -1.000000e+00 -4.898425e-16
[16]  1.000000e+00  6.123032e-16 -1.000000e+00 -7.347638e-16  1.000000e+00
[21]  8.572244e-16
cos(pi * x)
 [1] -1.000000e+00  3.061516e-16  1.000000e+00 -1.836910e-16 -1.000000e+00
 [6]  6.123032e-17  1.000000e+00  6.123032e-17 -1.000000e+00 -1.836910e-16
[11]  1.000000e+00  3.061516e-16 -1.000000e+00 -4.286122e-16  1.000000e+00
[16]  5.510729e-16 -1.000000e+00 -2.449890e-15  1.000000e+00 -9.803627e-16
[21] -1.000000e+00
tan(pi * x)
 [1]  3.673940e-16 -3.266248e+15  2.449294e-16 -5.443746e+15  1.224647e-16
 [6] -1.633124e+16  0.000000e+00  1.633124e+16 -1.224647e-16  5.443746e+15
[11] -2.449294e-16  3.266248e+15 -3.673940e-16  2.333034e+15 -4.898587e-16
[16]  1.814582e+15 -6.123234e-16  4.081778e+14 -7.347881e-16 -1.020058e+15
[21] -8.572528e-16

In the previous code, I used the variable pi. R provides the variable pi which is set to the value of the mathematical constant \(\pi\)

pi
[1] 3.141593

Although R provides the pi variable, there is nothing preventing a user from changing its value For example, I could redefine pi to 3.14 or any other value.

pi <- 3.14
pi
[1] 3.14
pi <- "Apple"
pi
[1] "Apple"

For that reason, if you are using the builtin pi variable in computations and are paranoid, you may want to always reference it as base::pi.

base::pi
[1] 3.141593

In the previous code block, since the angles were in radians, I wrote them as \(\pi\) times some number. Since it is often easier to write radians multiple of \(\pi\), R provides some convenience functions that do that. The function sinpi(x), is equivalent to sin(pi * x). The functions cospi() and tanpi() are similarly defined for the sin and tan functions, respectively.

sinpi(x)
 [1]  0 -1  0  1  0 -1  0  1  0 -1  0  1  0 -1  0  1  0 -1  0  1  0
cospi(x)
 [1] -1  0  1  0 -1  0  1  0 -1  0  1  0 -1  0  1  0 -1  0  1  0 -1
tanpi(x)
NaNs produced
 [1]   0 NaN   0 NaN   0 NaN   0 NaN   0 NaN   0 NaN   0 NaN   0 NaN   0
[18] NaN   0 NaN   0

R provides the function arc-cosine (acos()), arc-sine (asin()), and arc-tangent (atan()).

x <- seq(-1, 1, by = 1 / 4)
acos(x)
[1] 3.1415927 2.4188584 2.0943951 1.8234766 1.5707963 1.3181161 1.0471976
[8] 0.7227342 0.0000000
asin(x)
[1] -1.5707963 -0.8480621 -0.5235988 -0.2526803  0.0000000  0.2526803
[7]  0.5235988  0.8480621  1.5707963
atan(x)
[1] -0.7853982 -0.6435011 -0.4636476 -0.2449787  0.0000000  0.2449787
[7]  0.4636476  0.6435011  0.7853982

Finally, R provides the function atan2(). Calling atan2(y, x) returns the angle between the x-axis and the vector from (0,0) to (x, y).

atan2(c(1, 0, -1, 0), c(0, 1, 0, -1))
[1]  1.570796  0.000000 -1.570796  3.141593
