Lesson 2

In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013.

We will generate simple graphical and numerical summaries of data on these flights and explore delay times. You can find the instructions for this lab here.

# 1. load the library "tidyverse"
library(tidyverse)

# 2. use the read_csv file to read the dataset

nycflights <- read_csv('data/nycflights.csv')

Exercise 1

Question: Experiment with different binwidths to adjust your histogram in a way that will show its important features. You may also want to try using the + scale_x_log10() . What features are revealed now that were obscured in the original histogram? Note: When using the log scale, you may need to experiment with bin widths that are smaller than 1, such as binwidth=0.1 or even less!

# Write your code to create a histogram of delays

glimpse(nycflights)

## Rows: 32,735
## Columns: 16
## $ year      <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <dbl> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day       <dbl> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time  <dbl> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time  <dbl> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight    <dbl> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…

lax_flights <- nycflights %>%
  filter(dest == "LAX")

ggplot(data = lax_flights, aes(x = dep_delay)) +
  geom_histogram(binwidth= 15, fill = 'lightblue', color='darkblue') + ggtitle("Departure Delays for Flights to LAX") +
  labs(x='Departure Delay (min)', y= 'Amount of flights to LAX')

Answer: It provides a better view of data distribution

Exercise 2

Question: Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria? Try using the function nrow() inside of inline code in your answer, and knit your file to see that your text shows the answer correctly.

# Insert code for Exercise 2 here

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

num_flights <- nrow(sfo_feb_flights)

Answer: There were 68 flights from NY to San Francisco in February.

Exercise 3

Question: Describe the distribution of the arrival delays of flights headed to SFO in February, using an appropriate histogram and summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

# Insert code for Exercise 3 here

sfo_feb_flights %>%
  group_by(origin) %>%
  summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())

sfo_feb_flights %>%
  ggplot(aes(x = arr_delay)) +
  geom_histogram(binwidth = 15, fill = 'lightblue', color = 'darkblue') +
  ggtitle("Distribution of Arrival Delays for Flights to SFO in February") +
  labs(x = "Arrival Delay (min)", y = "Frequency")

Answer: The distribution is right-skewed.

Exercise 4

Question: Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

# Insert code for the Exercise here

sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(median_ad= median(arr_delay), iqr_ar=IQR(arr_delay), n_flights=n())

ggplot(data=sfo_feb_flights, aes(x=arr_delay)) +
  geom_histogram(binwidth = 5, fill= 'pink', color= 'violet')+
  labs (x= 'Arrival Delays (min)', y= ' Amount of Flights')

Answer: AA has the the most variable arrival delays.

Exercise 5

Question: Create a list of origin airports and their rate of on-time-departure. Then visualize the distribution of on-time-departure rate across the three airports using a segmented bar plot (see below). If you could select an airport based on on time departure percentage, which NYC airport would you choose to fly out of? Hint: For the segmented bar plot, will need to map the aesthetic arguments as follows: x = origin, fill = dep_type and a geom_bar() layer. Create three plots, one with geom_bar() layer, one with geom_bar(position = "fill") and the third with geom_bar(position = "dodge"). Explain the difference between the three results.

# Insert code for the Exercise here

nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

nycflights %>%
  group_by(origin) %>%
  summarise(on_time_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(on_time_dep_rate))

nycflights %>%
ggplot(aes(x=origin,fill=dep_type)) +
  geom_bar() +
  ggtitle("Distribution of Delayed and On-Time Flights") +
  labs(x = "Place of origin", y = "Amount of flights", fill='Types of departure')

nycflights %>%
ggplot(aes(x=origin,fill =dep_type)) +
 geom_bar(position="fill") +
  ggtitle("Distribution of Delayed and On-Time Flights") +
  labs(x = "Place of origin", y = "Amount of flights", fill='Types of departure')

nycflights %>%
  ggplot(aes(x = origin, fill = dep_type)) +
  geom_bar(position='dodge') +
  ggtitle("Distribution of Delayed and On-Time Flights") +
  labs(x = "Place of origin", y = "Amount of flights", fill='Types of departure')

Exercise 6

Question: Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by plane for each flight (in mph). Now make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes. You will need to use geom_point(). Think carefully which of the two variables is the predictor (on the x-axis) and which is the outcome variable (on the y-axis) and explain why you made this choice. Describe the relationship between average speed and distance.

# Insert code for the Exercise here

nycflights <- nycflights %>%
  mutate(
    distance_km = distance * 1.6,
    air_time_h = air_time/ 60,
    avg_speed = distance/ (air_time_h))

nycflights %>%
ggplot(aes(x=distance, y= avg_speed)) +
  geom_point(color='blue') +
  scale_x_log10() +
  scale_y_log10() +
  ggtitle('Average speed vs distance') +
  labs (x= 'Distance (miles)', y= 'Average speed (mph)')

Answer: General trend shows that with increased distance, the average speed is greater.

Exercise 7

Question: Replicate the following plot and determine what is the cutoff point for the latest departure delay where you can still have a chance to arrive at your destination on time. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. To determine the cut off point, try scaling the x-axis and the y-axis on the logarithmic scale. You can also filter the data, so that you plot only data where arr_delay <= 0.

# Insert code for the Exercise here

nycflights %>%
 dplyr::filter(carrier %in% c('AA', 'DL', 'UA')) %>%
 dplyr::filter(arr_delay <=0) %>%
 ggplot(aes(x= dep_delay, y=arr_delay, color=carrier)) + 
 geom_point(size = 0.5) +
 labs(x= 'Departure delay(min)', y= 'Arrival Delay(min)', color= 'Airline')+
 ggtitle('Departure Delay vs. Arrival Delay')

Answer: Graph seems to indicate that a departure delay of over 25 minutes makes the arrival on time very unlikely.Beyond 50 minutes, it is almost impossible.