library(tidyverse)
library(openintro)
library(dplyr)
library(ggplot2)
library(lubridate)

We will preview the data and understand its content and data type.

data(nycflights)
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day       <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time  <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time  <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight    <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…

Exercise 1: Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?

The histograms are different with each one showing different number of bars than the others. The bar heights and widths are also different. The histogram with the smallest bin width has the highest number of bar and smallest bar width and height. The opposite is true for the histogram with the largest bin width of 150. The histogram with the small bin width of 15 revealed more details while the one with the largest bin width of 150 revealed very little details. The histogram with the smallest bin width, revealed too much detail and noise in the data, making it difficult to discern any meaningful patterns. On the other hand, in the histogram with the largest bin width of 150, important features of the data distribution is obscured. The histogram where the bin width is not specified seems to give a better visual representation of the distribution of the data.

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15)

ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150)

Exercise 2: Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

There were 68 flights headed to SFO in February

count_of_sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)  %>%
  summarise("Number of SFO flights"         = n())

count_of_sfo_feb_flights

Exercise 3: Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

The distribution of arrival delays is shown below. The distribution seem to be slightly skewed to the left with outlier arrival delays. For this sub data set, the median will be a better measure of centrality since it is robust to outliers and the skewness of the data. The IRQ will also be calculated since it is robust to outliers eventhough it does not address skewness.

sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

  ggplot(sfo_feb_flights, aes(arr_delay)) + geom_histogram(bins = 30)

sfo_feb_flights %>%
  summarise(
    median_ad = median(arr_delay), 
    iqr_ad = IQR(arr_delay), 
    n = n()
    )

Exercise 4: Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

Below are the median and IQR for SFO arrival delays grouped by carrier.UA has the most variable arrival times.

sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(
    median_ad = median(arr_delay), 
    iqr_ad = IQR(arr_delay),
     n = n()
    )

Exercise 5: Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

Choosing between the month with the lowest mean departure delay and the month with the lowest median departure delay has its own set of pros and cons. Here’s a breakdown of each option:

Choosing the Month with the Lowest Mean Departure Delay:

Pros:

Reflects Overall Performance: The mean departure delay considers all the data points and gives you an idea of the average delay throughout the month. This can be useful for getting a general sense of what to expect.

Cons:

Affected by Outliers: The mean can be heavily influenced by extreme values or outliers. A few instances of very long delays can significantly skew the mean, making it not very representative of the typical experience.

May Not Capture Variability: It doesn’t tell you how delays are distributed throughout the month. You could have days with minimal delay and other days with substantial delays, and the mean may not reveal this variation.

Choosing the Month with the Lowest Median Departure Delay:

Pros:

Resistant to Outliers: The median is less affected by outliers compared to the mean. If there are a few days with exceptionally long delays, the median will remain relatively stable and represent the middle value in the dataset.

Useful for Mitigating Worst-case Scenarios: If you want to minimize the chance of experiencing long delays, choosing the month with the lowest median departure delay can be a safer bet, as it’s less influenced by extreme delays.

Cons:

Less Reflective of Average Experience: While the median is more robust to outliers, it doesn’t provide information about the average experience. There could be a month with a low median delay but a relatively high number of days with moderate delays.

Excercise 6: If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?

Base on the on-time departure rate below, assuming that a departure no later than 5 minutes is on time, LGA has the highest proportion of on-time departure time. I would therefore choose to fly out of LGA.

# add a departure type variable
nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

# group flights by airport and calculate on-time departure rate.
nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))

Excercise 7: Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.

The average speed for each flight in mph is calculated by first dividing the air time by 60 to convert it from minutes to hours and then dividing the distance by the calculated air time. The calculation and a preview of the new data frame is shown below.

nycflights <- nycflights %>%  
              mutate(avg_speed  = (distance/(air_time/60)))

glimpse(nycflights)
## Rows: 32,735
## Columns: 18
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day       <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time  <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time  <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight    <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
## $ dep_type  <chr> "delayed", "on time", "on time", "on time", "on time", "on t…
## $ avg_speed <dbl> 474.4409, 443.8889, 394.9468, 446.6667, 355.2000, 318.6957, …

Excercise 8: Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().

The scatter plot shows that for shorter distances, the average speed is lower. This is due to the relatively larger proportion of time spent on takeoff and landing. For the longer distances however, the average speed is higher due to the extended time spent in cruise. However, the average speed becomes plateaus as total distance increases when the flights attains their specified cruise speed.

nycflights <- nycflights %>%  
              mutate(avg_speed  = (distance/(air_time/60)))

ggplot(nycflights, 
       aes(x = avg_speed, 
           y = distance)) +
  geom_point() + 
  labs( 
       title = "Relationship between average speed and distance to destination",
       x = "Average Speed",
       y = "Total Distance"
       ) +
  theme(plot.title = element_text(hjust = 0.5,
                                  size = 14)
        ) 

Excercise 9: Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

To replicate the plot, we use the filter function and the %in% to select only rows that match AA, DL, and UA in the carrier variable. We then use ggplot function and use dep_delay, arr_delay, and carrier for x, y, and color as arguments in the aes function. we then use the geom_point function as the graphical plot.

Assuming that an arrival delay of less than 5 minutes is an on-time arrival, the cutoff departure delay for each destination is shown in the table below the graph. I added the number of trips that meets the criteria in the third column.

AA_DL_UA_flights <- nycflights %>%
  filter(carrier %in% c("AA", "DL", "UA"))
 

ggplot(AA_DL_UA_flights, 
       aes(dep_delay, arr_delay, color = carrier)) + 
  geom_point() 

ontime_nyc <- nycflights %>%
                      select(dest, arr_delay, dep_delay) %>%
                      group_by(dest) %>% 
                      filter(arr_delay < 5 & dep_delay > 0) 

max_dep_delay <- ontime_nyc %>% summarise(cutoff_dep_delay = max(dep_delay), n = n()) %>%
                        arrange(desc(cutoff_dep_delay))
  

max_dep_delay