The Data

The data nycflights includes information about 16 variables for a random sample of 32735 flights from New York City in 2013, leaving 3 different airports. The categorical variables included are month, origin, destination, and carrier. The quantitative variables are departure delay, airtime, distance, hour and minutes. The other variables are tail number, departure time, day of month

Analysis

Departure Delays
Exercise 1 - Distribution of Departure Delays

Of the flights leaving New York City area airports, the distribution of departure delays is unimodal and skewed high. The average flight delay if -2 minutes which means that the flights on average leave 2 minutes early (based on median). The middle 50% of flights range from 5 minutes early departure to 11 minutes late. There were 13 flights with delays of over 400 minutes. One flight left of 1300 minutes late (over 21 hours late!).

Exercise 2: How many flights headed to SFO in February?

#create a new data frame of sfo flights in February
sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

From our sample, 68 flights went from New York city area airports to San Francisco International airport in February

Exercise 3: Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics.

Of the 68 flights in sampled which left from the New York metro area with a destination to San Francisco, the distribution is unimodal and skewed right with at least 3 outliers on the high end. The center of the distribution is -2, which indicates a flight that leaves 2 minutes early. The middle 50% of flights leave between 5 minutes early and 9 minutes late. This is an IQR of 14 minutes.

## # A tibble: 1 × 3
##   mean_dd median_dd     n
##     <dbl>     <dbl> <int>
## 1    10.5        -2    68

Exercise 4: After learning to create groups, Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

## # A tibble: 5 × 4
##   carrier median_arrd iqr_arrd n_flights
##   <chr>         <dbl>    <dbl>     <int>
## 1 AA              5       17.5        10
## 2 B6            -10.5     12.2         6
## 3 DL            -15       22          19
## 4 UA            -10       22          21
## 5 VX            -22.5     21.2        12

Based on the IQR, Delta Airlines and United Airlines had the most variable arrival delays.

Departure Delays by Month

Exercise 5: After learning to group, summarize and arrange: Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

Using the mean has a pro of accounting for all of the values (flights) included in the data. However, because the distribution is skewed, the mean will be pulled higher, possibly making it less representative. The median represents the middle of the values and is not affected by skew but also does not account for the outlier variables.

## # A tibble: 12 × 2
##    month mean_dd
##    <int>   <dbl>
##  1     7   20.8 
##  2     6   20.4 
##  3    12   17.4 
##  4     4   14.6 
##  5     3   13.5 
##  6     5   13.3 
##  7     8   12.6 
##  8     2   10.7 
##  9     1   10.2 
## 10     9    6.87
## 11    11    6.10
## 12    10    5.88
## # A tibble: 12 × 2
##    month median_dd
##    <int>     <dbl>
##  1    12         1
##  2     6         0
##  3     7         0
##  4     3        -1
##  5     5        -1
##  6     8        -1
##  7     1        -2
##  8     2        -2
##  9     4        -2
## 10    11        -2
## 11     9        -3
## 12    10        -3
On Time Departure Rate for NYC Airports

Exercise 6: Which NYC airport would you choose to fly out of? You should also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.

## # A tibble: 3 × 2
##   origin ot_dep_rate
##   <chr>        <dbl>
## 1 LGA          0.728
## 2 JFK          0.694
## 3 EWR          0.637

I would choose to fly out of La Guardia Airport as that airport has the highest rate of on-time departure.

More Practice

Exercise 7: Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.

nycflights <- nycflights %>%
  mutate(avg_speed = distance/air_time*60) 

Exercise 8: Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance.

The relationship between average speed and distance is logarithmic. Below 500 miles, the average speed rises quickly as distance rises. However, at distance over 1000 miles, the average speed remains in the range of about 300-550 miles per hour.

Exercise 9: What the cutoff point is for departure delays where you can still expect to get to your destination on time.

According to the graph below, it looks as if we can be up to about 60 minutes late and still arrive on time, for the most part