The data nycflights includes information about 16 variables for a random sample of 32735 flights from New York City in 2013, leaving 3 different airports. The categorical variables included are month, origin, destination, and carrier. The quantitative variables are departure delay, airtime, distance, hour and minutes. The other variables are tail number, departure time, day of month
Of the flights leaving New York City area airports, the distribution of departure delays is unimodal and skewed high. The average flight delay if -2 minutes which means that the flights on average leave 2 minutes early (based on median). The middle 50% of flights range from 5 minutes early departure to 11 minutes late. There were 13 flights with delays of over 400 minutes. One flight left of 1300 minutes late (over 21 hours late!).
#create a new data frame of sfo flights in February
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
From our sample, 68 flights went from New York city area airports to San Francisco International airport in February
Of the 68 flights in sampled which left from the New York metro area with a destination to San Francisco, the distribution is unimodal and skewed right with at least 3 outliers on the high end. The center of the distribution is -2, which indicates a flight that leaves 2 minutes early. The middle 50% of flights leave between 5 minutes early and 9 minutes late. This is an IQR of 14 minutes.
## # A tibble: 1 × 3
## mean_dd median_dd n
## <dbl> <dbl> <int>
## 1 10.5 -2 68
## # A tibble: 5 × 4
## carrier median_arrd iqr_arrd n_flights
## <chr> <dbl> <dbl> <int>
## 1 AA 5 17.5 10
## 2 B6 -10.5 12.2 6
## 3 DL -15 22 19
## 4 UA -10 22 21
## 5 VX -22.5 21.2 12
Based on the IQR, Delta Airlines and United Airlines had the most variable arrival delays.
Using the mean has a pro of accounting for all of the values (flights) included in the data. However, because the distribution is skewed, the mean will be pulled higher, possibly making it less representative. The median represents the middle of the values and is not affected by skew but also does not account for the outlier variables.
## # A tibble: 12 × 2
## month mean_dd
## <int> <dbl>
## 1 7 20.8
## 2 6 20.4
## 3 12 17.4
## 4 4 14.6
## 5 3 13.5
## 6 5 13.3
## 7 8 12.6
## 8 2 10.7
## 9 1 10.2
## 10 9 6.87
## 11 11 6.10
## 12 10 5.88
## # A tibble: 12 × 2
## month median_dd
## <int> <dbl>
## 1 12 1
## 2 6 0
## 3 7 0
## 4 3 -1
## 5 5 -1
## 6 8 -1
## 7 1 -2
## 8 2 -2
## 9 4 -2
## 10 11 -2
## 11 9 -3
## 12 10 -3
## # A tibble: 3 × 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
I would choose to fly out of La Guardia Airport as that airport has the highest rate of on-time departure.
nycflights <- nycflights %>%
mutate(avg_speed = distance/air_time*60)
The relationship between average speed and distance is logarithmic. Below 500 miles, the average speed rises quickly as distance rises. However, at distance over 1000 miles, the average speed remains in the range of about 300-550 miles per hour.
According to the graph below, it looks as if we can be up to about 60 minutes late and still arrive on time, for the most part