In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013.

We will generate simple graphical and numerical summaries of data on these flights and explore delay times. You can find the instructions for this lab here

# 1. load the library "tidyverse"
library(tidyverse)

# 2. use the read_csv file to read the dataset
nycflights <- read_csv("data/nycflights.csv")
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year      <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <dbl> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day       <dbl> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time  <dbl> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time  <dbl> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight    <dbl> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…

Exercise 1

Question: Experiment with different binwidths to adjust your histogram in a way that will show its important features. You may also want to try using the + scale_x_log10(). What features are revealed now that were obscured in the original histogram?

ggplot(data = nycflights, 
       aes(x = dep_delay)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = nycflights, aes(x = dep_delay)) + geom_histogram(binwidth = 15)

ggplot(data = nycflights, aes(x = dep_delay)) + geom_histogram(binwidth = 150)

Answer: Smaller binwidths lead to more details in the data. Larger binwidths smooths the data, meaning the data is obscured.

Exercise 2

Question: Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)
meet_crit <- nrow(sfo_feb_flights)

Answer: 68 flights meet this criteria.

Exercise 3

Question: Describe the distribution of the arrival delays of flights headed to SFO in February, using an appropriate histogram and summary statistics.

ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
        geom_histogram(binwidth = 5)

summary(sfo_feb_flights$arr_delay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -66.00  -21.25  -11.00   -4.50    2.00  196.00

Answer: The distribution is skewed right, meaning that the majority of the flights arrived early.

Exercise 4

Question: Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n()) %>%
arrange(desc(iqr_ad))

Answer: DL and UA have the most variable arrival delay, when taking in account the IQR.

Exercise 5

Question: Create a list of origin airports and their rate of on-time-departure. Then visualize the distribution of on-time-departure rate across the three airports using a segmented bar plot (see below). If you could select an airport based on on time departure percentage, which NYC airport would you choose to fly out of? Hint: For the segmented bar plot, will need to map the aesthetic arguments as follows: x = origin, fill = dep_type and a geom_bar() layer. Create three plots, one with geom_bar() layer, one with geom_bar(position = "fill") and the third with geom_bar(position = "dodge"). Explain the difference between the three results.

nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

nycflights %>%
  group_by(origin) %>%
  summarise(on_time_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(on_time_dep_rate))
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar()

ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar(position = "fill")

ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar(position = "dodge")

Answer: I would select LGA, based on the fact that this airport has the fewest delays.

Exercise 6

Question: Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph, or if you are brave, in km/h). Now make a scatter plot of distance vs. avg_speed. Think carefully which of the two variables is the predictor (on the x-axis) and which is the outcome variable (on the y-axis) and explain why you made this choice. Describe the relationship between average speed and distance.

nycflights <- nycflights %>%
        mutate(avg_speed = distance / (air_time / 60))

ggplot(nycflights, aes(distance, avg_speed )) + geom_point()

Answer: There seems to be a relation between average speed and distance. The further an airplane flies, the higher the average speed of the airplane. The relation is approcximately linear.

Exercise 7

Question: Replicate the following plot and determine what is the cutoff point for the latest departure delay where you can still have a chance to arrive at your destination on time.

aa_da_ua <- nycflights %>%
  filter((carrier == "AA" | carrier == "DL" | carrier == "UA") & arr_delay <= 0) 
ggplot(aa_da_ua, aes(x = dep_delay, y = arr_delay, color = carrier))  +
        geom_point() + xlab("Departure delay (in minutes)") + ylab("Arrival delay (in minutes)")

Answer: 64 minutes