R for Data Science, Chapter 5

Library functions

library(rmarkdown)
library(knitr)
library(ggplot2)
library(tidyverse)

library(nycflights13)

5.2.4 Exercises

Okay, let’s start up looking at the data frame itself!

nycflights13::flights
flights <- nycflights13::flights

Okay, so now let us move on to the flitering!

1. Find all flights that

a. Had an arrival delay of two or more hours

Okay, for this one we just follow the filter function brought forward in 5.2. One thing to note, the arr_delay is listed in minutes so two or more hours should be entered as 120 minutes.

filter(flights, arr_delay >= 120)

b. Flew to Houston (IAH or HOU)

There are two options for this question, you can either do dest == or you can use pipping (%in%)

filter(flights, dest == "IAH" | dest == "HOU")

filter(flights, dest %in% c ("IAH", "HOU"))

c. Were operated by United, American, or Delta

First thing I’m going to do is to see the various airline carriers to see what the different initials are.

unique(flights$carrier)
 [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV" "OO"
airlines

It looks like I’m going to go for UA, AA, DL

filter (flights, carrier %in% c ("UA", "AA", "DL"))

d. Departed in summer (July, August, and September)

filter (flights, month %in% c ("July", "August", "September"))

The reason that did not work is that the months have a numerical value assigned to them! There is also a lot of different ways to approach this question, below is one option.

filter (flights, month %in% c (7, 8, 9))

For this I went with c, but I just as easily could have done 7:9. Because it is numerical, you can also use the strategy we did for question 1a.

filter (flights, month >= 7, month <= 9)

One way to confirm that the three months were all selected is to do a summary of this filter.

months.flights <- filter (flights, month >= 7, month <= 9)

summary(months.flights)
      year          month            day           dep_time    sched_dep_time   dep_delay          arr_time    sched_arr_time   arr_delay     
 Min.   :2013   Min.   :7.000   Min.   : 1.00   Min.   :   1   Min.   : 106   Min.   : -26.00   Min.   :   1   Min.   :   1   Min.   : -68.0  
 1st Qu.:2013   1st Qu.:7.000   1st Qu.: 8.00   1st Qu.: 904   1st Qu.: 905   1st Qu.:  -5.00   1st Qu.:1051   1st Qu.:1115   1st Qu.: -19.0  
 Median :2013   Median :8.000   Median :16.00   Median :1356   Median :1359   Median :  -1.00   Median :1523   Median :1550   Median :  -7.0  
 Mean   :2013   Mean   :7.979   Mean   :15.88   Mean   :1346   Mean   :1342   Mean   :  13.79   Mean   :1485   Mean   :1525   Mean   :   6.4  
 3rd Qu.:2013   3rd Qu.:9.000   3rd Qu.:23.00   3rd Qu.:1743   3rd Qu.:1729   3rd Qu.:  11.00   3rd Qu.:1931   3rd Qu.:1938   3rd Qu.:  13.0  
 Max.   :2013   Max.   :9.000   Max.   :31.00   Max.   :2400   Max.   :2359   Max.   :1014.00   Max.   :2400   Max.   :2359   Max.   :1007.0  
                                                NA's   :1878                  NA's   :1878      NA's   :2053                  NA's   :2267    
   carrier              flight       tailnum             origin              dest              air_time        distance         hour           minute    
 Length:86326       Min.   :   1   Length:86326       Length:86326       Length:86326       Min.   : 21.0   Min.   :  17   Min.   : 1.00   Min.   : 0.0  
 Class :character   1st Qu.: 583   Class :character   Class :character   Class :character   1st Qu.: 79.0   1st Qu.: 502   1st Qu.: 9.00   1st Qu.:10.0  
 Mode  :character   Median :1543   Mode  :character   Mode  :character   Mode  :character   Median :122.0   Median : 833   Median :13.00   Median :29.0  
                    Mean   :1981                                                            Mean   :146.2   Mean   :1054   Mean   :13.16   Mean   :26.8  
                    3rd Qu.:3395                                                            3rd Qu.:187.0   3rd Qu.:1400   3rd Qu.:17.00   3rd Qu.:45.0  
                    Max.   :6181                                                            Max.   :640.0   Max.   :4983   Max.   :23.00   Max.   :59.0  
                                                                                            NA's   :2267                                                 
   time_hour                  
 Min.   :2013-07-01 05:00:00  
 1st Qu.:2013-07-23 19:00:00  
 Median :2013-08-15 09:00:00  
 Mean   :2013-08-15 18:22:37  
 3rd Qu.:2013-09-07 16:00:00  
 Max.   :2013-09-30 23:00:00  
                              

e. Arrived more than two hours late, but didn’t leave late

filter (flights, arr_delay > 120, dep_delay == 0)

This is good, if you don’t look at the rest of the data and see that there are some departures that leave even before “on time.” Which just sounds annoying as a passenger.

unique(flights$dep_delay)
  [1]    2    4   -1   -6   -4   -5   -3   -2    0    1   -8    8   11    3   13   24   -7   -9    9   47   39  -10    5  101    7   71  -11  853    6   43
 [31]   23   59   12   14   15   21   25   29  -15  -13   32  144   10   34   16   18  134   96   30   41   55   37   26   77   22   40   57   70   56   35
 [61]   31  115   38   50   27  105  122   88   64  119   54   84   33   42   52   82   36  -14   91   62  103   74  290  260   61   63  131   19   46  129
 [91]  155  157  216   73  121  109   51   72  255   49  285  -12  141  192   83  116  379   NA  156   20   80   79  107   45  179   75   28  104   17  100
[121]   65  120  224   90  268  334   67  139   69   99  128  337   76   98  133   85  181   53   66  102   97   48  168  180  164  158  140  175  108  125
[151]  185   68  111   93   44  126  162  174   58   86  171   81  123  252   60  106   78  291  177  137  -17  110  118  114   89  145  208  288  -19  142
[181]  203  127   95  257   94  327  225  -16  117  163  202  151  152   87  293  112  178  366  188  148  143 1301  253 1126  196  385  307  167  360  -30
[211]  -18  241  221  -22  282  -20  213  315  153  135  138  599  149  176   92  195  154  113  183  220  172  229  186  193  266  246  170  124  502  204
[241]  212  214  207  231  187  130  308  281  227  146  205  132  222  242  259  238  173  292  147  275  189  -21  161  276  251  198  150  271  256  274
[271]  159  190  165  233  160  199  239  194  166  184  478  191  210  209  318  169  329  262  230  336  323  294  206  254  270  328  197  182  287  349
[301]  211  136  295  201  280  -27  217  235  265  250  237  234  248  228  245  243  240  279  306  232  218  219  278  342  378  223  364  226  247  200
[331]  272  324  352  286  317  702  297  316  289  390  387  373  -25  215  310  -23  311  322  798  413  -32  335  367  299  305  277  398  351  339  341
[361]  347  636  298  312  302  408  313  687  303  244  896  283  309  300  389  333  405  361  431  301  382  330  261  548  340  273  249  374  -43  321
[391]  264  368  263  236  825  660  284  392  503  845  432  849  356  296  486  314  420  -33  269  415  319  355  320  592  747  788  786  404  354  -24
[421]  430  470  332  258  376  346  348  393  394  383  406  911  359  371  331  800  443  345  326  545  384  325  440  437  639  960  510  414  350  267
[451]  427  761  797  753  812  381  357  423  375  434  878  397  391  504  471  533  494  369  401  475  410  466  386  613  419  380  446  388  343  447
[481] 1137  467  426  790  787  803  899  436  396  500  454  411  363  353  589  452  304  629  653  399  362  453  479  634  576  409  580  483  365 1005
[511]  898  536  370  344  372  520  -26  508  338  424  696  473  514  358  602  593 1014  422
filter (flights, arr_delay > 120, dep_delay <= 0)

f. Were delayed by at least an hour, but made up over 30 minutes in flight

There is no explicit category for making up time in the flight. However, if you take the departure delay you can compare it with the arrival delay. If time was made up on the flight, than you would have a difference between the two numbers.

filter (flights, dep_delay >= 60, dep_delay - arr_delay > 30)

g. Departed between midnight and 6am (inclusive)

summary (flights$dep_time)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      1     907    1401    1349    1744    2400    8255 

This summary, alongside the tables throughout this prompt, indicates that the departure and arrival times are in four digits. The Max is at 2400 (midnight) and the min is at 1 (12:01am)

filter (flights, dep_time >0, dep_time <= 600)

2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

You can look this up through the help function

?between

Between allows you to create a shortcut from many of the filter codes above, as between() is a "shortcut for x >= left & x <= right. It is used as between (x, left, right). This can replace just about any of the answers to the questions above, let’s look at question g since we just talked about it.

filter (flights, dep_time >0, dep_time <= 600)
filter (flights, between (dep_time, 0, 600))

3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

filter (flights, is.na (dep_time))

There were 8,255 flights that are missing a departure time. Each of these 8,255 flights are also missing departure delays, arrival times, and arrival delays. What this indicates is that if one is missing, the other will be missing as well. Of course, that is a obvious observation. This all boils down to the possibility that these flights were canceled. If a plane didn’t depart, it wouldn’t arrive.

4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

NA ^ 0
[1] 1
NA | TRUE
[1] TRUE
NA & FALSE
[1] FALSE

NA ^ 0 comes out to 1 because this code equates to x to the 0 power, and anything that is raised to the 0 power = 1.

NA | TRUE is not missing and TRUE because it is looking at something that is not defined. Also, because it is not explicitly defining something anything could be true. True is True, nothing is True, False is True if something is false. No matter what the missing value is (true or false), this code means it will be true.

NA & False is not missing and FALSE for basically the same reason as above, but inverse. If you put anything next to FALSE (the &) it will always be false. Anything & False is ultimately FALSE. Not matter what the missing value is (true or false), this code means it will be false.

I could be completely wrong, but I believe that the general rule is that if something isn’t clearly defined it will equate to a general mathematical rule. It all comes down to the language assigned with the values? You ultimately need something to equal what the language is saying. NA | False would be a missing because ultimately TRUE | FALSE stile == TRUE.

5.3.1 Exercises

1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

arrange (flights, dep_time) %>%
  tail()

So, by just using arrange you are not sorting all of the missing values in their order at the start. They are inherently sorted at the end. In order to present the missing values to the start (and order so that January appears rather than September).

To mix that up, we take the hint from the question. But we do it alongside the re-ordering function (desc()) that is also described in the R for Data Science chapter 5.

arrange (flights, desc(is.na(dep_time)))

2a. Sort flights to find the most delayed flights.

This is a lot more straight forward. Just do what I did above, and forget about the is.na. We don’t care about those missing numbers, because we only care about those poor people who had their flights delayed MOST due to weather, mechanical issues, or geese over the Hudson. Because we want the hightest values, we use the descending order function.

arrange (flights, desc(dep_delay))

2b. Find the flights that left earliest.

Because we want the lowest numbers, we just ditch the desc.

arrange (flights, dep_delay)

This still baffles me. How can a flight leave early? I’d still be grabbing a cup of coffee from Dunkin.

3. Sort flights to find the fastest (highest speed) flights.

This kind of left we baffled for a bit, cause how do you define “fastest.” By looking at the rows we have already defined for us, I would say that air_time means the fastest? Cause we aren’t looking at hours or minutes, cause just because a flight was shorter doesn’t mean it is faster. Some people blow money to fly from BWI to Richmond, VA. That plane may take its time for all we know.

arrange (flights, air_time)

4a. Which flights traveled the farthest?

Just like above, you could interpret this question as being related to air time, but I decided to go with distance. Just to mix up what I was sorting.

arrange (flights, desc(distance))

4b. Which traveled the shortest?

arrange (flights, distance)

5.4.1 Exercises

1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

Well, we can first do it the simple way. By name.

select(flights, dep_time, dep_delay, arr_time, arr_delay)

You can select the variables through their column numbers.

select (flights, 4, 6, 7, 9)

You can do starts_with ()

select (flights, starts_with("dep_"), starts_with ("arr"))

all (of) and any (of) -> both of which we will talk more about for question 3.

select (flights, all_of (c("dep_time", "dep_delay", "arr_time", "arr_delay")))
select (flights, any_of (c("dep_time", "dep_delay", "arr_time", "arr_delay")))

For other options, and they are almost “endless,” you can look at ?select. Plus, there are other ways of approaching this.

?select

2. What happens if you include the name of a variable multiple times in a select() call?

select (flights, year, month, day, dep_time, dep_time)

It looks like it just skips over the repeated varaible, since depature time was only displayed once.

3a. What does the any_of() function do?

?any_of

It allows you to select variable from character vectors, like that of all_of. What any_of does is look at variables contained in a character vector without checking for missing variables. Any_of is great for checking negative selections, as it will ignore anything out of the ‘sorts!’

3b. Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, any_of(vars))

It is useful because it can cut the chase of constantly having to retype out the variable names every time you apply them. By creating a vector you are able to streamline your data a bit for both ease and for being better consistent.

4a. Does the result of running the following code surprise you?

select(flights, contains("TIME"))

A bit? Looking more into the function afterwards, it makes plenty of sense. Everything in R is generally case sensitive, but contains ignores that. In the case of this code, it looks for anything in your data frame that contains “time.” No matter the case involved.

4b. How do the select helpers deal with case by default?

This is essentially in place to make sure that nothing is left out unintentionally. I believe it is just to make the user interface easier, and helps allivate any potential human error.

4c. How can you change that default?

select(flights, contains("TIME", ignore.case = FALSE))
