library(rmarkdown)
library(knitr)
library(ggplot2)
library(tidyverse)
library(nycflights13)
Okay, let’s start up looking at the data frame itself!
nycflights13::flights
flights <- nycflights13::flights
Okay, so now let us move on to the flitering!
1. Find all flights that
a. Had an arrival delay of two or more hours
Okay, for this one we just follow the filter function brought forward in 5.2. One thing to note, the arr_delay is listed in minutes so two or more hours should be entered as 120 minutes.
filter(flights, arr_delay >= 120)
b. Flew to Houston (IAH or HOU)
There are two options for this question, you can either do dest == or you can use pipping (%in%)
filter(flights, dest == "IAH" | dest == "HOU")
filter(flights, dest %in% c ("IAH", "HOU"))
c. Were operated by United, American, or Delta
First thing I’m going to do is to see the various airline carriers to see what the different initials are.
unique(flights$carrier)
[1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV" "OO"
airlines
It looks like I’m going to go for UA, AA, DL
filter (flights, carrier %in% c ("UA", "AA", "DL"))
d. Departed in summer (July, August, and September)
filter (flights, month %in% c ("July", "August", "September"))
The reason that did not work is that the months have a numerical value assigned to them! There is also a lot of different ways to approach this question, below is one option.
filter (flights, month %in% c (7, 8, 9))
For this I went with c, but I just as easily could have done 7:9. Because it is numerical, you can also use the strategy we did for question 1a.
filter (flights, month >= 7, month <= 9)
One way to confirm that the three months were all selected is to do a summary of this filter.
months.flights <- filter (flights, month >= 7, month <= 9)
summary(months.flights)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
Min. :2013 Min. :7.000 Min. : 1.00 Min. : 1 Min. : 106 Min. : -26.00 Min. : 1 Min. : 1 Min. : -68.0
1st Qu.:2013 1st Qu.:7.000 1st Qu.: 8.00 1st Qu.: 904 1st Qu.: 905 1st Qu.: -5.00 1st Qu.:1051 1st Qu.:1115 1st Qu.: -19.0
Median :2013 Median :8.000 Median :16.00 Median :1356 Median :1359 Median : -1.00 Median :1523 Median :1550 Median : -7.0
Mean :2013 Mean :7.979 Mean :15.88 Mean :1346 Mean :1342 Mean : 13.79 Mean :1485 Mean :1525 Mean : 6.4
3rd Qu.:2013 3rd Qu.:9.000 3rd Qu.:23.00 3rd Qu.:1743 3rd Qu.:1729 3rd Qu.: 11.00 3rd Qu.:1931 3rd Qu.:1938 3rd Qu.: 13.0
Max. :2013 Max. :9.000 Max. :31.00 Max. :2400 Max. :2359 Max. :1014.00 Max. :2400 Max. :2359 Max. :1007.0
NA's :1878 NA's :1878 NA's :2053 NA's :2267
carrier flight tailnum origin dest air_time distance hour minute
Length:86326 Min. : 1 Length:86326 Length:86326 Length:86326 Min. : 21.0 Min. : 17 Min. : 1.00 Min. : 0.0
Class :character 1st Qu.: 583 Class :character Class :character Class :character 1st Qu.: 79.0 1st Qu.: 502 1st Qu.: 9.00 1st Qu.:10.0
Mode :character Median :1543 Mode :character Mode :character Mode :character Median :122.0 Median : 833 Median :13.00 Median :29.0
Mean :1981 Mean :146.2 Mean :1054 Mean :13.16 Mean :26.8
3rd Qu.:3395 3rd Qu.:187.0 3rd Qu.:1400 3rd Qu.:17.00 3rd Qu.:45.0
Max. :6181 Max. :640.0 Max. :4983 Max. :23.00 Max. :59.0
NA's :2267
time_hour
Min. :2013-07-01 05:00:00
1st Qu.:2013-07-23 19:00:00
Median :2013-08-15 09:00:00
Mean :2013-08-15 18:22:37
3rd Qu.:2013-09-07 16:00:00
Max. :2013-09-30 23:00:00
e. Arrived more than two hours late, but didn’t leave late
filter (flights, arr_delay > 120, dep_delay == 0)
This is good, if you don’t look at the rest of the data and see that there are some departures that leave even before “on time.” Which just sounds annoying as a passenger.
unique(flights$dep_delay)
[1] 2 4 -1 -6 -4 -5 -3 -2 0 1 -8 8 11 3 13 24 -7 -9 9 47 39 -10 5 101 7 71 -11 853 6 43
[31] 23 59 12 14 15 21 25 29 -15 -13 32 144 10 34 16 18 134 96 30 41 55 37 26 77 22 40 57 70 56 35
[61] 31 115 38 50 27 105 122 88 64 119 54 84 33 42 52 82 36 -14 91 62 103 74 290 260 61 63 131 19 46 129
[91] 155 157 216 73 121 109 51 72 255 49 285 -12 141 192 83 116 379 NA 156 20 80 79 107 45 179 75 28 104 17 100
[121] 65 120 224 90 268 334 67 139 69 99 128 337 76 98 133 85 181 53 66 102 97 48 168 180 164 158 140 175 108 125
[151] 185 68 111 93 44 126 162 174 58 86 171 81 123 252 60 106 78 291 177 137 -17 110 118 114 89 145 208 288 -19 142
[181] 203 127 95 257 94 327 225 -16 117 163 202 151 152 87 293 112 178 366 188 148 143 1301 253 1126 196 385 307 167 360 -30
[211] -18 241 221 -22 282 -20 213 315 153 135 138 599 149 176 92 195 154 113 183 220 172 229 186 193 266 246 170 124 502 204
[241] 212 214 207 231 187 130 308 281 227 146 205 132 222 242 259 238 173 292 147 275 189 -21 161 276 251 198 150 271 256 274
[271] 159 190 165 233 160 199 239 194 166 184 478 191 210 209 318 169 329 262 230 336 323 294 206 254 270 328 197 182 287 349
[301] 211 136 295 201 280 -27 217 235 265 250 237 234 248 228 245 243 240 279 306 232 218 219 278 342 378 223 364 226 247 200
[331] 272 324 352 286 317 702 297 316 289 390 387 373 -25 215 310 -23 311 322 798 413 -32 335 367 299 305 277 398 351 339 341
[361] 347 636 298 312 302 408 313 687 303 244 896 283 309 300 389 333 405 361 431 301 382 330 261 548 340 273 249 374 -43 321
[391] 264 368 263 236 825 660 284 392 503 845 432 849 356 296 486 314 420 -33 269 415 319 355 320 592 747 788 786 404 354 -24
[421] 430 470 332 258 376 346 348 393 394 383 406 911 359 371 331 800 443 345 326 545 384 325 440 437 639 960 510 414 350 267
[451] 427 761 797 753 812 381 357 423 375 434 878 397 391 504 471 533 494 369 401 475 410 466 386 613 419 380 446 388 343 447
[481] 1137 467 426 790 787 803 899 436 396 500 454 411 363 353 589 452 304 629 653 399 362 453 479 634 576 409 580 483 365 1005
[511] 898 536 370 344 372 520 -26 508 338 424 696 473 514 358 602 593 1014 422
filter (flights, arr_delay > 120, dep_delay <= 0)
f. Were delayed by at least an hour, but made up over 30 minutes in flight
There is no explicit category for making up time in the flight. However, if you take the departure delay you can compare it with the arrival delay. If time was made up on the flight, than you would have a difference between the two numbers.
filter (flights, dep_delay >= 60, dep_delay - arr_delay > 30)
g. Departed between midnight and 6am (inclusive)
summary (flights$dep_time)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 907 1401 1349 1744 2400 8255
This summary, alongside the tables throughout this prompt, indicates that the departure and arrival times are in four digits. The Max is at 2400 (midnight) and the min is at 1 (12:01am)
filter (flights, dep_time >0, dep_time <= 600)
2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
You can look this up through the help function
?between
Between allows you to create a shortcut from many of the filter codes above, as between() is a "shortcut for x >= left & x <= right. It is used as between (x, left, right). This can replace just about any of the answers to the questions above, let’s look at question g since we just talked about it.
filter (flights, dep_time >0, dep_time <= 600)
filter (flights, between (dep_time, 0, 600))
3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
filter (flights, is.na (dep_time))
There were 8,255 flights that are missing a departure time. Each of these 8,255 flights are also missing departure delays, arrival times, and arrival delays. What this indicates is that if one is missing, the other will be missing as well. Of course, that is a obvious observation. This all boils down to the possibility that these flights were canceled. If a plane didn’t depart, it wouldn’t arrive.
4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
NA ^ 0
[1] 1
NA | TRUE
[1] TRUE
NA & FALSE
[1] FALSE
NA ^ 0 comes out to 1 because this code equates to x to the 0 power, and anything that is raised to the 0 power = 1.
NA | TRUE is not missing and TRUE because it is looking at something that is not defined. Also, because it is not explicitly defining something anything could be true. True is True, nothing is True, False is True if something is false. No matter what the missing value is (true or false), this code means it will be true.
NA & False is not missing and FALSE for basically the same reason as above, but inverse. If you put anything next to FALSE (the &) it will always be false. Anything & False is ultimately FALSE. Not matter what the missing value is (true or false), this code means it will be false.
I could be completely wrong, but I believe that the general rule is that if something isn’t clearly defined it will equate to a general mathematical rule. It all comes down to the language assigned with the values? You ultimately need something to equal what the language is saying. NA | False would be a missing because ultimately TRUE | FALSE stile == TRUE.
1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
arrange (flights, dep_time) %>%
tail()
So, by just using arrange you are not sorting all of the missing values in their order at the start. They are inherently sorted at the end. In order to present the missing values to the start (and order so that January appears rather than September).
To mix that up, we take the hint from the question. But we do it alongside the re-ordering function (desc()) that is also described in the R for Data Science chapter 5.
arrange (flights, desc(is.na(dep_time)))
2a. Sort flights to find the most delayed flights.
This is a lot more straight forward. Just do what I did above, and forget about the is.na. We don’t care about those missing numbers, because we only care about those poor people who had their flights delayed MOST due to weather, mechanical issues, or geese over the Hudson. Because we want the hightest values, we use the descending order function.
arrange (flights, desc(dep_delay))
2b. Find the flights that left earliest.
Because we want the lowest numbers, we just ditch the desc.
arrange (flights, dep_delay)
This still baffles me. How can a flight leave early? I’d still be grabbing a cup of coffee from Dunkin.
3. Sort flights to find the fastest (highest speed) flights.
This kind of left we baffled for a bit, cause how do you define “fastest.” By looking at the rows we have already defined for us, I would say that air_time means the fastest? Cause we aren’t looking at hours or minutes, cause just because a flight was shorter doesn’t mean it is faster. Some people blow money to fly from BWI to Richmond, VA. That plane may take its time for all we know.
arrange (flights, air_time)
4a. Which flights traveled the farthest?
Just like above, you could interpret this question as being related to air time, but I decided to go with distance. Just to mix up what I was sorting.
arrange (flights, desc(distance))
4b. Which traveled the shortest?
arrange (flights, distance)
1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
Well, we can first do it the simple way. By name.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
You can select the variables through their column numbers.
select (flights, 4, 6, 7, 9)
You can do starts_with ()
select (flights, starts_with("dep_"), starts_with ("arr"))
all (of) and any (of) -> both of which we will talk more about for question 3.
select (flights, all_of (c("dep_time", "dep_delay", "arr_time", "arr_delay")))
select (flights, any_of (c("dep_time", "dep_delay", "arr_time", "arr_delay")))
For other options, and they are almost “endless,” you can look at ?select. Plus, there are other ways of approaching this.
?select
2. What happens if you include the name of a variable multiple times in a select() call?
select (flights, year, month, day, dep_time, dep_time)
It looks like it just skips over the repeated varaible, since depature time was only displayed once.
3a. What does the any_of() function do?
?any_of
It allows you to select variable from character vectors, like that of all_of. What any_of does is look at variables contained in a character vector without checking for missing variables. Any_of is great for checking negative selections, as it will ignore anything out of the ‘sorts!’
3b. Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, any_of(vars))
It is useful because it can cut the chase of constantly having to retype out the variable names every time you apply them. By creating a vector you are able to streamline your data a bit for both ease and for being better consistent.
4a. Does the result of running the following code surprise you?
select(flights, contains("TIME"))
A bit? Looking more into the function afterwards, it makes plenty of sense. Everything in R is generally case sensitive, but contains ignores that. In the case of this code, it looks for anything in your data frame that contains “time.” No matter the case involved.
4b. How do the select helpers deal with case by default?
This is essentially in place to make sure that nothing is left out unintentionally. I believe it is just to make the user interface easier, and helps allivate any potential human error.
4c. How can you change that default?
select(flights, contains("TIME", ignore.case = FALSE))