## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It appears that some flights are leaving before their departure time and this is not apparent in the first histogram with the lower binwidth.
lax_flights <- nycflights %>%
filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## # A tibble: 1 x 3
## mean_dd median_dd n
## <dbl> <dbl> <int>
## 1 9.78 -1 1583
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
sfo_feb_flights %>% summarise(count = n())
## # A tibble: 1 x 1
## count
## <int>
## 1 68
68 flights departed to San Francisco in February.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sfo_feb_flights %>%
summarise(median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())
## # A tibble: 1 x 3
## median_ad iqr_ad n_flights
## <dbl> <dbl> <int>
## 1 -11 23.2 68
The distribution is right-skewed and unimodal. The median and IQR would be the best summary statistics to avoid variability due to the few outliers. The Median arrival delay is -11 and the IQR is 23.25
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
## carrier median_ad iqr_ad n_flights
## <chr> <dbl> <dbl> <int>
## 1 AA 5 17.5 10
## 2 B6 -10.5 12.2 6
## 3 DL -15 22 19
## 4 UA -10 22 21
## 5 VX -22.5 21.2 12
Delta Airlines (DL) and United Airlines (UA) have the most variability in arrival delays.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## month mean_dd
## <int> <dbl>
## 1 7 20.8
## 2 6 20.4
## 3 12 17.4
## 4 4 14.6
## 5 3 13.5
## 6 5 13.3
## 7 8 12.6
## 8 2 10.7
## 9 1 10.2
## 10 9 6.87
## 11 11 6.10
## 12 10 5.88
nycflights %>%
group_by(month) %>%
summarise(median_dd = median(dep_delay)) %>%
arrange(desc(median_dd))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## month median_dd
## <int> <dbl>
## 1 12 1
## 2 6 0
## 3 7 0
## 4 3 -1
## 5 5 -1
## 6 8 -1
## 7 1 -2
## 8 2 -2
## 9 4 -2
## 10 11 -2
## 11 9 -3
## 12 10 -3
The Median is a more robust value as it is less susceptible to outliers. In this case, I think measuring the outliers and thus, using the mean, would give us a better chance to avoid most delays.
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
Based only On Time Departure Percentage, La Guardia would be the logical choice.
## # A tibble: 6 x 18
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 6 30 940 15 1216 -4 VX N626VA 407
## 2 2013 5 7 1657 -3 2104 10 DL N3760C 329
## 3 2013 12 8 859 -1 1238 11 DL N712TW 422
## 4 2013 5 14 1841 -4 2122 -34 DL N914DL 2391
## 5 2013 7 21 1102 -3 1230 -8 9E N823AY 3652
## 6 2013 1 1 1817 -3 2008 3 AA N3AXAA 353
## # ... with 8 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, dep_type <chr>, avg_speed <dbl>
There appears to be a non-linear relationship between avg_speed and distance. There is an increase of avg speed when distance increases.
airlines <- c("AA", "DL", "UA")
selectedFlights <- nycflights %>%
filter(carrier %in% airlines)
qplot(x = dep_delay, y = arr_delay, data = selectedFlights, color = carrier)
George Cruz