library(tidyverse) library(openintro)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

1

The three histograms do obscure some features and also exemplify others. The first histogram shows that the majority of flights are delayed by an hour or two. Histogram 2 shows that there are a good amount early flights that were hidden from the first. And the last histogram shows the real scale of the amount of flights within the most frequent interval being in the 30000’s.

2

sfo_feb_flights <- filter(nycflights, month == '2', dest == 'SFO')

count(sfo_feb_flights)
## # A tibble: 1 × 1
##       n
##   <int>
## 1    68

There are 68 total flights with these characteristics.

3

ggplot(data = sfo_feb_flights, aes(x = dep_delay)) + geom_histogram(bins = 50)

4

sfo_feb_flights %>% group_by(carrier) %>% summarise(median_dd = median(arr_delay), iqr_dd = IQR(dep_delay), n_flights = n())
## # A tibble: 5 × 4
##   carrier median_dd iqr_dd n_flights
##   <chr>       <dbl>  <dbl>     <int>
## 1 AA            5     32.8        10
## 2 B6          -10.5    3.5         6
## 3 DL          -15      6.5        19
## 4 UA          -10     13          21
## 5 VX          -22.5   16.8        12

AA has the most variable arrival delays.

5

The problem with using the month with the lowest mean may be misleading if the data is left or right skewed. Example would be if there were a very high number of 0 to 10 minute delays or early flights but an equal amount of >10 minute delays, the mean can be close to 0 but chances are you’ll get at least 50/50 chance of a delay. Though if the data is even, the mean would be a better indicator. Otherwise, it would be better to use the median.

6

I would choose to fly out of LGA as that has the highest on time departure rate to delays.

7

nycflights <- nycflights %>% mutate(avg_speed = distance/air_time *60)

8

ggplot(nycflights, aes(x = distance, y = avg_speed)) + geom_point()

I noticed that the greater the distance traveled, the higher the average speed usually is.

9

nycflights3 <- nycflights %>% filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data = nycflights3, aes(x = dep_delay, y = arr_delay, color= carrier)) + geom_point()

It looks as thought the the cutoff point is a little less than half of the distance from 0 to 100 so around a little more than an hour.