Assignment 7

Author

Theresa Anderson

Open the assign07.qmd file and complete the exercises.

This is somewhat open-ended assignment.

The file domestic_flights_jan_2016.csv is nearly the same as the me_flights.csv file we work with in the temporal data chapter except it has two additional columns for destination city and state and it is for all domestic flights reported for on-time performance in the US for January 2016. You’ll find it helpful to recreate all of the calculated fields we create in the unit.
You are to write a report uploaded to RPubs that compares what you consider interesting metrics for a select group of carriers, airports, or states. You may not parrot the queries from the text or the practice questions. Your report must contain at least two questions that you ask about the flights data. Your answers to those questions must also contain a visualization of the data or a table along with a specific answer in the narrative.


The csv file contains 445,827 observations. You’ll want to subset the data to the area(s) you are looking at, then write it out to a csv file using write_csv(), and start your assignment by importing that csv instead. Do this in a separate script file that you don’t need to submit. In your narrative, describe your subset. I don’t need to see how you subsetted the data because it might cause performance issues when you render the document. Note: you will receive deductions for not using tidyverse syntax in this assignment. That includes the use of filter, mutate, and the up-to-date pipe operator |>.

The Grading Rubric is available at the end of this document.

This is your work area. Add as many code cells as you need.

I read the data in a separate script file and ran a count of the Origin airport and then filtered by the descending count to determine which airport had the most flights. I found that the Atlanta Airport (ATL) had the most data so I will be examining this data. My two questions are:

  1. Which airline has the most delays from the Atlanta Airport?

  2. What are the average delay from the Atlanta Airport for each airline?

library(tidyverse)
library(skimr)
library(gapminder)
library(gt)
library(gtExtras)
library(scales)

flights_ATL <- read_csv("flights_ATL.csv")
glimpse(flights_ATL)
Rows: 29,870
Columns: 21
$ FlightDate        <chr> "1/1/2016", "1/2/2016", "1/3/2016", "1/4/2016", "1/5…
$ Carrier           <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
$ TailNum           <chr> "N029AA", "N005AA", "N014AA", "N018AA", "N030AA", "N…
$ FlightNum         <dbl> 139, 139, 139, 139, 139, 139, 139, 139, 139, 139, 13…
$ Origin            <chr> "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "AT…
$ OriginCityName    <chr> "Atlanta, GA", "Atlanta, GA", "Atlanta, GA", "Atlant…
$ OriginState       <chr> "GA", "GA", "GA", "GA", "GA", "GA", "GA", "GA", "GA"…
$ Dest              <chr> "MIA", "MIA", "MIA", "MIA", "MIA", "MIA", "MIA", "MI…
$ DestCityName      <chr> "Miami, FL", "Miami, FL", "Miami, FL", "Miami, FL", …
$ DestState         <chr> "FL", "FL", "FL", "FL", "FL", "FL", "FL", "FL", "FL"…
$ CRSDepTime        <dbl> 810, 810, 810, 810, 810, 810, 810, 810, 810, 810, 81…
$ DepTime           <dbl> 808, 812, 807, 825, 805, 806, 806, 818, 823, 1033, 8…
$ WheelsOff         <dbl> 820, 827, 819, 838, 819, 828, 822, 834, 843, 1152, 8…
$ WheelsOn          <dbl> 944, 952, 1000, 1004, 952, 959, 949, 1005, 1017, 131…
$ CRSArrTime        <dbl> 1016, 1016, 1016, 1016, 1010, 1010, 1010, 1010, 1010…
$ ArrTime           <dbl> 1001, 1002, 1006, 1010, 1033, 1035, 1010, 1019, 1027…
$ Cancelled         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Diverted          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ CRSElapsedTime    <dbl> 126, 126, 126, 126, 120, 120, 120, 120, 120, 120, 12…
$ ActualElapsedTime <dbl> 113, 110, 119, 105, 148, 149, 124, 121, 124, 194, 12…
$ Distance          <dbl> 594, 594, 594, 594, 594, 594, 594, 594, 594, 594, 59…
flights_ATL |>
  skim()
Data summary
Name flights_ATL
Number of rows 29870
Number of columns 21
_______________________
Column type frequency:
character 9
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
FlightDate 0 1 8 9 0 31 0
Carrier 0 1 2 2 0 9 0
TailNum 48 1 5 6 0 2171 0
Origin 0 1 3 3 0 1 0
OriginCityName 0 1 11 11 0 1 0
OriginState 0 1 2 2 0 1 0
Dest 0 1 3 3 0 158 0
DestCityName 0 1 9 34 0 154 0
DestState 0 1 2 2 0 49 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
FlightNum 0 1.00 2212.06 1526.16 1 1178 1786 2529 6822 ▆▇▁▂▁
CRSDepTime 0 1.00 1471.23 465.45 520 1050 1456 1918 2335 ▃▇▆▇▅
DepTime 405 0.99 1475.42 472.09 1 1050 1459 1921 2359 ▁▅▇▇▇
WheelsOff 410 0.99 1500.46 474.15 1 1105 1513 1936 2359 ▁▃▇▇▇
WheelsOn 420 0.99 1563.48 495.92 1 1151 1554 2008 2400 ▁▂▇▇▇
CRSArrTime 0 1.00 1575.22 496.50 2 1159 1601 2014 2359 ▁▂▇▇▇
ArrTime 420 0.99 1562.06 504.14 1 1153 1555 2009 2400 ▁▂▇▇▇
Cancelled 0 1.00 0.01 0.12 0 0 0 0 1 ▇▁▁▁▁
Diverted 0 1.00 0.00 0.04 0 0 0 0 1 ▇▁▁▁▁
CRSElapsedTime 0 1.00 122.21 63.77 37 82 107 138 616 ▇▁▁▁▁
ActualElapsedTime 457 0.98 115.39 61.22 27 78 102 128 638 ▇▁▁▁▁
Distance 0 1.00 639.52 458.06 83 357 547 731 4502 ▇▁▁▁▁
flights_ATL |>
  filter(if_any(everything(), is.na))
# A tibble: 457 × 21
   FlightDate Carrier TailNum FlightNum Origin OriginCityName OriginState Dest 
   <chr>      <chr>   <chr>       <dbl> <chr>  <chr>          <chr>       <chr>
 1 1/3/2016   AA      N029AA       1127 ATL    Atlanta, GA    GA          MIA  
 2 1/23/2016  AA      N003AA       1127 ATL    Atlanta, GA    GA          MIA  
 3 1/23/2016  AA      N4YUAA       1574 ATL    Atlanta, GA    GA          ORD  
 4 1/23/2016  AA      N488AA       1635 ATL    Atlanta, GA    GA          DFW  
 5 1/9/2016   AA      <NA>          488 ATL    Atlanta, GA    GA          PHX  
 6 1/23/2016  AA      <NA>          821 ATL    Atlanta, GA    GA          CLT  
 7 1/23/2016  AA      <NA>         1808 ATL    Atlanta, GA    GA          CLT  
 8 1/23/2016  AA      <NA>         2062 ATL    Atlanta, GA    GA          CLT  
 9 1/23/2016  AA      <NA>         2085 ATL    Atlanta, GA    GA          PHL  
10 1/22/2016  AA      <NA>          451 ATL    Atlanta, GA    GA          CLT  
# ℹ 447 more rows
# ℹ 13 more variables: DestCityName <chr>, DestState <chr>, CRSDepTime <dbl>,
#   DepTime <dbl>, WheelsOff <dbl>, WheelsOn <dbl>, CRSArrTime <dbl>,
#   ArrTime <dbl>, Cancelled <dbl>, Diverted <dbl>, CRSElapsedTime <dbl>,
#   ActualElapsedTime <dbl>, Distance <dbl>

It appears the flight data with NA values are primarily cancelled flights, we will filter these out of the data and begin the reformatting and calculation process of determining which airline has the most delays. Because I am focusing on flight delays I will focus on those calculations.

flights_ATL <- flights_ATL |>
  filter(Cancelled == 0) |>
  mutate(FlightDate = as.Date(FlightDate, format = "%m/%d/%Y")) |>
  mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime)),
         new_CRSDepTime = as.POSIXct(new_CRSDepTime, format = "%Y-%m-%d %H%M")) |>
  mutate(new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)),
         new_DepTime = as.POSIXct(new_DepTime, format="%Y-%m-%d %H%M"))

flights_ATL |>
  select(FlightDate, new_CRSDepTime, new_DepTime) |>
  head() |>
  gt()
FlightDate new_CRSDepTime new_DepTime
2016-01-01 2016-01-01 08:10:00 2016-01-01 08:08:00
2016-01-02 2016-01-02 08:10:00 2016-01-02 08:12:00
2016-01-03 2016-01-03 08:10:00 2016-01-03 08:07:00
2016-01-04 2016-01-04 08:10:00 2016-01-04 08:25:00
2016-01-05 2016-01-05 08:10:00 2016-01-05 08:05:00
2016-01-12 2016-01-12 08:10:00 2016-01-12 08:06:00
flights_ATL <- flights_ATL |>
  mutate(DepDelay = as.integer(difftime(new_DepTime, new_CRSDepTime, units = "mins"))) |>
  mutate(DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay),
         DepDel15 = ifelse(DepDelay >= 15, 1, 0))

flights_ATL |>
  select(DepDelay, DepDelayMinutes, DepDel15) |>
  head() |>
  gt()
DepDelay DepDelayMinutes DepDel15
-2 0 0
2 2 0
-3 0 0
15 15 1
-5 0 0
-4 0 0
flights_ATL |>
  group_by(Carrier) |>
  mutate(delayed = ifelse(DepDelay > 0, 1, 0)) |>
  summarize(perc_delay = sum(delayed) / n()) |>
  ggplot(aes(x = fct_reorder(Carrier, perc_delay, .desc = TRUE), y = perc_delay, fill = Carrier)) +
  geom_col() + 
  labs(title = "Departure Delay Frequency by Airline",
       subtitle = "Atlanta Airport January 2016",
       x = "Carrier",
       y = "Percent Delayed Flights") +
  scale_y_continuous(labels = label_percent()) +
  theme_minimal()

flights_ATL |>
  group_by(Carrier) |>
  summarize(avg_dep_delay = mean(DepDelay, na.rm = TRUE)) |>
  ggplot(aes(x = fct_reorder(Carrier, avg_dep_delay, .desc = TRUE), y = avg_dep_delay, 
             fill = Carrier)) +
  geom_col() + 
  labs(title = "Average Departure Delay by Airline",
       subtitle = "Atlanta Airport in January 2016",
       x = "Carrier",
       y = "Average Delay (Minutes)") +
  theme_minimal()

Report Summary

The three airlines that had the most frequent delays in January 2016 were NK, WN, and UA or Spirit Airlines, Southwest Airlines, and United Airlines respectively. Spirit Airlines had by far the most delays with over 50% of flights being delayed, and Southwest airlines having about 45% of flights delayed.

Looking at the average delay, again Spirit Airlines came in at the top with an average delay of about 18 minutes. Interestingly, SkyWest had the second longest average delay time, of about 11 minutes and Southwest Airlines was third at about 8.5 minutes of average delay.

While United Airlines came in second for most frequently delayed, their average delay is only about 4 minutes. Delta Airlines was right in the middle in terms of delay frequency but they experienced the shortest average delay of about 1 minute. This could be because Atlanta is a primary hub for Delta, so they have more plane departures than other airlines, providing them with a higher delay probability.

1 Submission

To submit your assignment:

  • Change the author name to your name in the YAML portion at the top of this document
  • Render your document to html and publish it to RPubs.
  • Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
  • Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

2 Grading Rubric

Item
(percent overall)
100% - flawless 67% - minor issues 33% - moderate issues 0% - major issues or not attempted
Question 1 query.
(22%)
Relevant question that is fully answered in the query or queries.
Question 1 visualization or table.
(15%)
Visually pleasant and relevant to the question.
Question 2 query.
(22%)
Relevant question that is fully answered in the query or queries.
Question 2 visualization or table.
(15%)
Visually pleasant and relevant to the question.
Data was subsetted separately from the assignment.
(10%)
You included the description of your subsetted data in your narrative. You subsetted the data but didn’t include the description in the narrative. NA You didn’t subset the data.
Messages and/or errors suppressed from rendered document and all code is shown.
(8%)
Submitted properly to Brightspace
(8%)
NA NA You must submit according to instructions to receive any credit for this portion.