Question 1

glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
?flights

Type your answer to question 1 here…

The data was originally collected from RITA and the Bureau of transportation statistics. Each row represents all the data collected from a certain variable. There are 336,776 rows and 19 columns in the data. The variables collect all the data for a specific question about nyc flights such as the dep_delay variable collects all the data for the departures delays of flights in minutes. There is missing data because after I used the drop_na function, there were rows removed.

Question 2

flights %>%
  ggplot(aes(x = dep_delay)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8255 rows containing non-finite values (stat_bin).

Type your answer to question 2 here…

The distribution of departure delay is unimodal and rightskewed has one large peak around 0. There does not seem to be an outliers and most delays around within an hour early or late.

Question 3

flights %>%
  ggplot(aes(x = origin)) +
  geom_bar()

Type your answer to question 3 here…

The airport with the most flights is EWR.

Question 4

flights_no_nas <- flights %>%
  drop_na()
flights_no_nas %>%
  group_by(origin) %>%
  summarize(mean(dep_delay))
## # A tibble: 3 × 2
##   origin `mean(dep_delay)`
##   <chr>              <dbl>
## 1 EWR                 15.0
## 2 JFK                 12.0
## 3 LGA                 10.3
flights_no_nas %>%
  group_by(origin) %>%
  summarize(sd(dep_delay))
## # A tibble: 3 × 2
##   origin `sd(dep_delay)`
##   <chr>            <dbl>
## 1 EWR               41.2
## 2 JFK               38.8
## 3 LGA               39.9
flights %>%
  ggplot(aes(x = dep_delay)) +
  geom_histogram() +
  facet_wrap(~origin)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8255 rows containing non-finite values (stat_bin).

Type your answer to question 4 here…

Each airport is unimodal and right skewed with a large peak around 0. However, EWR’s mean is the largest at 15.009, JFK’s mean is 12.023, and LGA’s mean is the smallest at 10.286.

Question 5

flights %>%
  drop_na() %>%
  ggplot(aes(x = dep_delay, 
             y = arr_delay, 
             color = origin,
             shape = origin)) +
  geom_point(alpha = 0.5) +
  labs(x = "Departure Delay (minutes)",
       y = "Arrival Delay (minutes)",
       title = "Arrival delay vs. departure delay in NYC airports",
       color = "Origin Airport",
       shape = "Origin Airport") +
  scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9")) +
  theme_bw()

Citations: For question #5 I used these two websites to help write my code.

“GGPLOT2 Colors : How to Change Colors Automatically and Manually?” STHDA, http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually.

“Function Reference.” Function Reference • ggplot2, https://ggplot2.tidyverse.org/reference/index.html.