Background You just started your internship at a big firm in New York, and your manager gave you an extensive file of flights that departed JFK, LGA, or EWR in 2013. From this data (nycflights13::flights), which you can obtain in R (install.packages(“nycflights13”); library(nycflights13)), your manager wants you to answer the following questions;
If I am leaving before noon, which two airlines do you recommend at each airport (JFK, LGA, EWR) that will have the lowest delay time at the 75th percentile? Which origin airport is best to minimize my chances of a late arrival when I am using Delta Airlines? Which destination airport is the worst (you decide on the metric for worst) airport for arrival time?
install.packages(“nycflights13”)
library(nycflights13)
library(tidyverse)
## -- Attaching packages --------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.1 v dplyr 0.8.0.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
If I am leaving before noon, which two airlines do you recommend at each airport (JFK, LGA, EWR) that will have the lowest delay time at the 75th percentile?
– SkyWest Airlines Inc. According to the ggplot below, n() -delays- is lower for SkyWest Airlines Inc. than for any other carrier.
library(nycflights13)
nycflights13::flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
nycflights13::airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
nycflights13::airports
## # A tibble: 1,458 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_~
## 2 06A Moton Field Municipa~ 32.5 -85.7 264 -6 A America/Chic~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic~
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_~
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_~
## 6 0A9 Elizabethton Municip~ 36.4 -82.2 1593 -5 A America/New_~
## 7 0G6 Williams County Airp~ 41.5 -84.5 730 -5 A America/New_~
## 8 0G7 Finger Lakes Regiona~ 42.9 -76.8 492 -5 A America/New_~
## 9 0P2 Shoestring Aviation ~ 39.8 -76.6 1000 -5 U America/New_~
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_~
## # ... with 1,448 more rows
ggplot(flights, aes(x = carrier, y = arr_delay))+
geom_count()+
theme(axis.text.x=element_text(angle=90, hjust=1))
## Warning: Removed 9430 rows containing non-finite values (stat_sum).
Which origin airport is best to minimize my chances of a late arrival when I am using Delta Airlines?
The best airport to use Delta Airlines is Newark Liberty International Airport using the number of delayed flights as a parameter.
Which destination airport is the worst (you decide on the metric for worst) airport for arrival time?
The worst airport to use Delta Airlines is La Gaurdia Airport using the number of delayed flights as a parameter.
Delta <- subset(flights, carrier == "DL", )
ggplot(Delta, aes(x = origin, y = arr_delay))+
geom_count()+
theme(axis.text.x=element_text(angle=90, hjust=1))
## Warning: Removed 452 rows containing non-finite values (stat_sum).