SAMPLE TAKE HOME EXAM 1

Questions

Question 1

Question 1 (5 points) - What are the ten most common destinations for flights from NYC airports in 2013? Make a table that lists these in descending order of frequency and shows the number of flights heading to each airport.

flights %>% 
  count(dest, name="number_flights_dest", sort = TRUE)

## # A tibble: 105 x 2
##    dest  number_flights_dest
##    <chr>               <int>
##  1 ORD                 17283
##  2 ATL                 17215
##  3 LAX                 16174
##  4 BOS                 15508
##  5 MCO                 14082
##  6 CLT                 14064
##  7 SFO                 13331
##  8 FLL                 12055
##  9 MIA                 11728
## 10 DCA                  9705
## # … with 95 more rows

Question 2

Which airlines have the most flights departing from NYC airports in 2013? Make a table that lists these in descending order of frequency and shows the number of flights for each airline. In your narrative mention the names of the airlines as well. Hint: You can use the airlines dataset to look up the airline name based on carrier code.

inner_join(flights,airlines) %>% 
  count(name, name="number_flights_airline", sort = TRUE)

## Joining, by = "carrier"

## # A tibble: 16 x 2
##    name                        number_flights_airline
##    <chr>                                        <int>
##  1 United Air Lines Inc.                        58665
##  2 JetBlue Airways                              54635
##  3 ExpressJet Airlines Inc.                     54173
##  4 Delta Air Lines Inc.                         48110
##  5 American Airlines Inc.                       32729
##  6 Envoy Air                                    26397
##  7 US Airways Inc.                              20536
##  8 Endeavor Air Inc.                            18460
##  9 Southwest Airlines Co.                       12275
## 10 Virgin America                                5162
## 11 AirTran Airways Corporation                   3260
## 12 Alaska Airlines Inc.                           714
## 13 Frontier Airlines Inc.                         685
## 14 Mesa Airlines Inc.                             601
## 15 Hawaiian Airlines Inc.                         342
## 16 SkyWest Airlines Inc.                           32

Question 3

Consider only flights that have non-missing arrival delay information. Your answer should include the name of the carrier in addition to the carrier code and the values asked. a. Which carrier had the highest mean arrival delay? b. Which carrier had the lowest mean arrival delay?

#initially check whether there are some NA in arrival delay indeed
#flights %>% arrange(desc(is.na(arr_delay)))
flights %>%
    group_by(carrier) %>%
    summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
    inner_join(airlines) %>% arrange(desc(delay)) %>% 
    slice_head(n = 1)

## `summarise()` ungrouping output (override with `.groups` argument)

## Joining, by = "carrier"

## # A tibble: 1 x 3
##   carrier delay name                  
##   <chr>   <dbl> <chr>                 
## 1 F9       21.9 Frontier Airlines Inc.

Question 4

What was the mean temperature at the origin airport on the day with the highest departure delay? Your answer should include the name of origin airport, the date with the highest departure delay, and the mean temperature on that day.

worst_day <-flights %>% 
  group_by(year,month,day) %>%
  summarise(max_dep_delay=max(dep_delay,na.rm=TRUE)) %>% 
  arrange(desc(max_dep_delay)) %>% 
  ungroup() %>% 
  slice_head(n = 1)

## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

#I can use semi-join, which connects the two tables and only the rows in x that have a match in y
weather_most_delayed <-
  semi_join(weather, worst_day,
            by = c("year", "month", "day" )) %>% 
  group_by(origin) %>% 
  summarise(mean_temp=mean(temp, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

weather_most_delayed

## # A tibble: 3 x 2
##   origin mean_temp
##   <chr>      <dbl>
## 1 EWR         42.1
## 2 JFK         42.7
## 3 LGA         44.9

Question 5

Consider breaking the day into four time intervals: 12:01am-6am, 6:01am-12pm, 12:01pm-6pm, 6:01pm-12am. a. Calculate the proportion of flights that are delayed at departure at each of these time intervals. b. Comment on how the likelihood of being delayed change throughout the day?

I will consider delayed as more than 2h delayed

#a
flights_intervals <- flights %>%
    mutate(
        interval= case_when(
            dep_time <= 600 & dep_time >= 1 ~ "first_interval",
            dep_time >= 601 & dep_time <= 1200 ~ "second_interval",
            dep_time >= 1201 &  dep_time <= 1800 ~ "third_interval", 
            dep_time >= 1801 & dep_time <= 2400 ~ "fourth_interval")
        %>% factor(levels=c('first_interval', 'second_interval', 'third_interval',"fourth_interval")))

#checking that I did this correctly 
#flights_intervals %>% count(interval)
#flights %>% ggplot(aes(dep_time))+geom_histogram()

flights_intervals %>% 
  group_by(interval) %>% 
  count(delayed_more_2h=dep_delay > 120) %>%
   add_count(interval, wt = n,
            name = "fligths_interval") %>% 
    mutate(prop = n/fligths_interval*100) %>% 
  drop_na()

## # A tibble: 8 x 5
## # Groups:   interval [4]
##   interval        delayed_more_2h      n fligths_interval   prop
##   <fct>           <lgl>            <int>            <int>  <dbl>
## 1 first_interval  FALSE             8672             9344 92.8  
## 2 first_interval  TRUE               672             9344  7.19 
## 3 second_interval FALSE           121395           122082 99.4  
## 4 second_interval TRUE               687           122082  0.563
## 5 third_interval  FALSE           118562           120738 98.2  
## 6 third_interval  TRUE              2176           120738  1.80 
## 7 fourth_interval FALSE            70169            76357 91.9  
## 8 fourth_interval TRUE              6188            76357  8.10

#b, prediction delay
flights_delayed<- flights %>% 
   # group flight cancellation and flight delay into one level
    mutate(delay = ifelse(dep_delay >= 120 | is.na(dep_delay) == TRUE, 1, 0))

#actually only 5% of flights are delayed
flights_delayed %>% count(delay) %>% mutate(prop=n/sum(n))

## # A tibble: 2 x 3
##   delay      n   prop
##   <dbl>  <int>  <dbl>
## 1     0 318633 0.946 
## 2     1  18143 0.0539

  flights_delayed %>% 
    filter(delay == 1) %>% 
    group_by(hour) %>% summarize(n_delays = n()) %>%
  ggplot(aes(x= hour, y = n_delays)) +
  geom_point() +
  geom_line(col = "blue")

## `summarise()` ungrouping output (override with `.groups` argument)

Most of the delays occur around 7 pm

Question 6

Find the flight with the longest air time. a. How long is this flight? b. What city did it fly to? c. How many seats does the plane that flew this flight have?

longest_distance<-flights %>% arrange( desc(distance)) %>% 
  slice_head(n=1) %>% pull(distance)
longest_destination<-flights %>% arrange( desc(distance)) %>% 
  slice_head(n=1) %>% pull(dest)
longest_origin<-flights %>% arrange( desc(distance)) %>% 
  slice_head(n=1) %>% pull(origin)
longest_seats <- inner_join(flights,planes, by="tailnum") %>% filter(tailnum=="N380HA") %>% pull(seats) %>% unique()

glue('The longest flight is {longest_origin} to {longest_destination}, which is {longest_distance} miles and has {longest_seats}.')

## The longest flight is JFK to HNL, which is 4983 miles and has 377.

Question 7

Question 7 (15 pts) - The airports data frame contains information on a large number of primarily American airports. These data include location information for these airports in the form of latitude and longitude coordinates. In this question we limit our focus to the Contiguous United States. Visualize and describe the distribution of the longitudes of airports in the Contiguous United States. What does this tell you about the geographical distribution of these airports? Hint: You will first need to limit your analysis to the Contiguous United States. This Wikipedia article can help, but you’re welcomed to use other resources as well. Make sure to cite whatever resource you use.

I use the tzone information ont the airport data and this this Wikipedia article on tzones names

flights_latlon <- flights %>%
    inner_join(select(airports, origin = faa, origin_lat = lat, origin_lon = lon),
               by = "origin"
    ) %>%
    right_join(select(airports, dest = faa, dest_lat = lat, dest_lon = lon,tzone),
               by = "dest"
    )

# Checking that I am looking withing the Contiguous United States
 flights_latlon %>%
    filter(!tzone%in% c("Pacific/Honolulu","America/Anchorage"))  %>% 
   
    ggplot(aes(
        x = origin_lon, xend = dest_lon,
        y = origin_lat, yend = dest_lat
    )) +
    borders("state") +
    geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
    coord_quickmap() +
    labs(y = "Latitude", x = "Longitude")

## Warning: Removed 1102 rows containing missing values (geom_segment).

flights_latlon %>% 
    filter(!tzone%in% c("Pacific/Honolulu","America/Anchorage"))  %>% 
    select(dest_lon) %>% 
    ggplot()+
    geom_freqpoly(aes(dest_lon), binwidth=1)

Map of the longitudes of airports in the Contiguous United States shows that the middle states have fewer airports than the Coasts

Question 8

Recreate the plot included below using the flights data. Once you have created the visualization, in no more than one paragraph, describe what you think the point of this visualization might be. Hint: The visualization uses the variable arrival, which is not included in the flights data frame. You will have to create arrival yourself, it is a categorical variable that is equal to “ontime” when arr_delay <= 0 and “delayed” when arr_delay > 0.

flights %>% 
  filter(dest %in% c("PHL","RDU")) %>% 
  filter(month==12) %>% 
  mutate(
    Arrival= case_when(
            arr_delay <= 0 ~ "ontime",
            arr_delay > 0 ~ "delayed")) %>% 
  filter(!is.na(Arrival)) %>% 
          ggplot(aes(x=Arrival, y= dep_delay, color=dest))+
  geom_boxplot()+
  facet_grid(dest~origin)+
  
  labs(title="On time performance of NYC fligths",
       subtitle="December 2013",
  y="Departure delay", 
        color = "Destination")+
   coord_cartesian( ylim = c(0,200))

Extra Credit

[Enter code and narrative here.]