Assignment 7

Go to the shared posit.cloud workspace for this class and open the assign07 project. Open the assign07.qmd file and complete the exercises.

This is somewhat open-ended assignment.

The file domestic_flights_jan_2016.csv is nearly the same as the me_flights.csv file we work with in the temporal data chapter except it has two additional columns for destination city and state and it is for all domestic flights reported for on-time performance in the US for January 2016. You’ll find it helpful to recreate all of the calculated fields we create in the unit.
You are to write a report uploaded to RPubs that compares what you consider interesting metrics for a select group of carriers, airports, or states. You may not parrot the queries from the text or the practice questions. Your report must contain at least two questions that you ask about the flights data. Your answers to those questions must also contain a visualization of the data or a table along with a specific answer in the narrative.


The csv file contains 445,827 observations. You’ll want to subset the data to the area(s) you are looking at, then write it out to a csv file using write_csv(), and start your assignment by importing that csv instead. Do this in a separate script file that you don’t need to submit. In your narrative, describe your subset. I don’t need to see how you subsetted the data because it might cause performance issues when you render the document. Note: you will receive deductions for not using tidyverse syntax in this assignment. That includes the use of filter, mutate, and the up-to-date pipe operator |>.

The Grading Rubric is available at the end of this document.

This is your work area. Add as many code cells as you need.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(gt)
flights <- read_csv("https://jsuleiman.com/datasets/domestic_flights_jan_2016.csv")
Rows: 445827 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): FlightDate, Carrier, TailNum, Origin, OriginCityName, OriginState,...
dbl (12): FlightNum, CRSDepTime, DepTime, WheelsOff, WheelsOn, CRSArrTime, A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(flights)
Rows: 445,827
Columns: 21
$ FlightDate        <chr> "1/6/2016", "1/7/2016", "1/8/2016", "1/9/2016", "1/1…
$ Carrier           <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
$ TailNum           <chr> "N4YBAA", "N434AA", "N541AA", "N489AA", "N439AA", "N…
$ FlightNum         <dbl> 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, …
$ Origin            <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DF…
$ OriginCityName    <chr> "Dallas/Fort Worth, TX", "Dallas/Fort Worth, TX", "D…
$ OriginState       <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX"…
$ Dest              <chr> "DTW", "DTW", "DTW", "DTW", "DTW", "DTW", "DTW", "DT…
$ DestCityName      <chr> "Detroit, MI", "Detroit, MI", "Detroit, MI", "Detroi…
$ DestState         <chr> "MI", "MI", "MI", "MI", "MI", "MI", "MI", "MI", "MI"…
$ CRSDepTime        <dbl> 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100…
$ DepTime           <dbl> 1057, 1056, 1055, 1102, 1240, 1107, 1059, 1055, 1058…
$ WheelsOff         <dbl> 1112, 1110, 1116, 1115, 1300, 1118, 1113, 1107, 1110…
$ WheelsOn          <dbl> 1424, 1416, 1431, 1424, 1617, 1426, 1429, 1419, 1420…
$ CRSArrTime        <dbl> 1438, 1438, 1438, 1438, 1438, 1438, 1438, 1438, 1438…
$ ArrTime           <dbl> 1432, 1426, 1445, 1433, 1631, 1435, 1438, 1431, 1428…
$ Cancelled         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Diverted          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ CRSElapsedTime    <dbl> 158, 158, 158, 158, 158, 158, 158, 158, 158, 158, 15…
$ ActualElapsedTime <dbl> 155, 150, 170, 151, 171, 148, 159, 156, 150, 158, 14…
$ Distance          <dbl> 986, 986, 986, 986, 986, 986, 986, 986, 986, 986, 98…
flights |> 
  skim()
Data summary
Name flights
Number of rows 445827
Number of columns 21
_______________________
Column type frequency:
character 9
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
FlightDate 0 1.00 8 9 0 31 0
Carrier 0 1.00 2 2 0 12 0
TailNum 4244 0.99 5 6 0 4238 0
Origin 0 1.00 3 3 0 294 0
OriginCityName 0 1.00 8 34 0 290 0
OriginState 0 1.00 2 2 0 52 0
Dest 0 1.00 3 3 0 294 0
DestCityName 0 1.00 8 34 0 290 0
DestState 0 1.00 2 2 0 52 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
FlightNum 0 1.00 2078.86 1757.27 1 702 1594 2763 7438 ▇▅▁▂▁
CRSDepTime 0 1.00 1330.38 482.81 1 920 1325 1730 2359 ▁▇▇▇▅
DepTime 11473 0.97 1334.24 492.96 1 924 1331 1737 2400 ▁▇▇▇▅
WheelsOff 11600 0.97 1356.55 493.87 1 939 1344 1750 2400 ▁▇▇▇▅
WheelsOn 11907 0.97 1483.30 514.35 1 1104 1519 1914 2400 ▁▅▇▇▇
CRSArrTime 0 1.00 1502.95 505.24 1 1118 1527 1920 2359 ▁▃▇▇▇
ArrTime 11907 0.97 1488.10 518.68 1 1108 1522 1919 2400 ▁▅▇▇▇
Cancelled 0 1.00 0.03 0.16 0 0 0 0 1 ▇▁▁▁▁
Diverted 0 1.00 0.00 0.04 0 0 0 0 1 ▇▁▁▁▁
CRSElapsedTime 0 1.00 146.50 76.61 21 90 128 180 705 ▇▃▁▁▁
ActualElapsedTime 12529 0.97 140.14 74.75 15 85 122 173 721 ▇▃▁▁▁
Distance 0 1.00 844.23 610.35 31 391 679 1086 4983 ▇▂▁▁▁
flights <- flights |>
  mutate(FlightDate = as.Date(FlightDate, format = "%m/%d/%Y"))
flights |> 
  select(FlightDate) |> 
  head()
# A tibble: 6 × 1
  FlightDate
  <date>    
1 2016-01-06
2 2016-01-07
3 2016-01-08
4 2016-01-09
5 2016-01-10
6 2016-01-11
flights |> 
  filter(Cancelled == 0) |> 
  summarize(maxcrsdep = max(CRSDepTime), maxdep = max(DepTime))
# A tibble: 1 × 2
  maxcrsdep maxdep
      <dbl>  <dbl>
1      2359   2400
flights <- flights |> 
  mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime))) |> 
  mutate(new_CRSDepTime = as.POSIXct(new_CRSDepTime, format = "%Y-%m-%d %H%M"))
flights |> 
  select(new_CRSDepTime, FlightDate, CRSDepTime) |>
  head()
# A tibble: 6 × 3
  new_CRSDepTime      FlightDate CRSDepTime
  <dttm>              <date>          <dbl>
1 2016-01-06 11:00:00 2016-01-06       1100
2 2016-01-07 11:00:00 2016-01-07       1100
3 2016-01-08 11:00:00 2016-01-08       1100
4 2016-01-09 11:00:00 2016-01-09       1100
5 2016-01-10 11:00:00 2016-01-10       1100
6 2016-01-11 11:00:00 2016-01-11       1100
flights <- flights |> 
  mutate(
    new_CRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)),
    new_CRSArrTime = as.POSIXct(new_CRSArrTime, format="%Y-%m-%d %H%M")
  )
flights <- flights |> 
  filter(Cancelled == 0) |> 
  mutate(
    new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)), 
    new_WheelsOff = paste(FlightDate, sprintf("%04d", WheelsOff)),
    new_WheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)), 
    new_ArrTime = paste(FlightDate, sprintf("%04d", ArrTime))
    )  |> 
  mutate(
    new_DepTime = as.POSIXct(new_DepTime, format="%Y-%m-%d %H%M"),
    new_WheelsOff = as.POSIXct(new_WheelsOff, format="%Y-%m-%d %H%M"),
    new_WheelsOn = as.POSIXct(new_WheelsOn, format="%Y-%m-%d %H%M"),
    new_ArrTime = as.POSIXct(new_ArrTime, format="%Y-%m-%d %H%M")
  )
flights |> 
  select(new_CRSArrTime, new_DepTime, new_WheelsOff, new_WheelsOn, new_ArrTime) |> 
  head() |> 
  gt()
new_CRSArrTime new_DepTime new_WheelsOff new_WheelsOn new_ArrTime
2016-01-06 14:38:00 2016-01-06 10:57:00 2016-01-06 11:12:00 2016-01-06 14:24:00 2016-01-06 14:32:00
2016-01-07 14:38:00 2016-01-07 10:56:00 2016-01-07 11:10:00 2016-01-07 14:16:00 2016-01-07 14:26:00
2016-01-08 14:38:00 2016-01-08 10:55:00 2016-01-08 11:16:00 2016-01-08 14:31:00 2016-01-08 14:45:00
2016-01-09 14:38:00 2016-01-09 11:02:00 2016-01-09 11:15:00 2016-01-09 14:24:00 2016-01-09 14:33:00
2016-01-10 14:38:00 2016-01-10 12:40:00 2016-01-10 13:00:00 2016-01-10 16:17:00 2016-01-10 16:31:00
2016-01-11 14:38:00 2016-01-11 11:07:00 2016-01-11 11:18:00 2016-01-11 14:26:00 2016-01-11 14:35:00
flights <- flights |>
  mutate(DepDelay = as.integer(difftime(new_DepTime, new_CRSDepTime, units = "mins")))
flights |> 
  select(CRSDepTime, DepTime, DepDelay) |> 
  head()
# A tibble: 6 × 3
  CRSDepTime DepTime DepDelay
       <dbl>   <dbl>    <int>
1       1100    1057       -3
2       1100    1056       -4
3       1100    1055       -5
4       1100    1102        2
5       1100    1240      100
6       1100    1107        7
flights <- flights |> 
  filter(Cancelled == 0) |> 
  mutate(
    DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay), 
    DepDel15 = ifelse(DepDelay >= 15, 1, 0)
    )
flights |> 
  select(DepDelay, DepDelayMinutes, DepDel15) |> 
  head()
# A tibble: 6 × 3
  DepDelay DepDelayMinutes DepDel15
     <int>           <dbl>    <dbl>
1       -3               0        0
2       -4               0        0
3       -5               0        0
4        2               2        0
5      100             100        1
6        7               7        0
flights <- flights |> 
  mutate(TaxiOut = as.integer(difftime(new_WheelsOff, new_DepTime, units = "mins")),
         TaxiIn = as.integer(difftime(new_ArrTime, new_WheelsOn, units = "mins")),
         ArrDelay = as.integer(difftime(new_ArrTime, new_CRSArrTime, units = "mins")),
         ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay), 
         ArrDel15 = ifelse(ArrDelay >= 15, 1, 0),
         FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime,
         AirTime = ActualElapsedTime - TaxiOut - TaxiIn,
         AirSpeed = Distance / (AirTime / 60)
)
flights |> 
  head() |> 
  gt()
FlightDate Carrier TailNum FlightNum Origin OriginCityName OriginState Dest DestCityName DestState CRSDepTime DepTime WheelsOff WheelsOn CRSArrTime ArrTime Cancelled Diverted CRSElapsedTime ActualElapsedTime Distance new_CRSDepTime new_CRSArrTime new_DepTime new_WheelsOff new_WheelsOn new_ArrTime DepDelay DepDelayMinutes DepDel15 TaxiOut TaxiIn ArrDelay ArrDelayMinutes ArrDel15 FlightTimeBuffer AirTime AirSpeed
2016-01-06 AA N4YBAA 43 DFW Dallas/Fort Worth, TX TX DTW Detroit, MI MI 1100 1057 1112 1424 1438 1432 0 0 158 155 986 2016-01-06 11:00:00 2016-01-06 14:38:00 2016-01-06 10:57:00 2016-01-06 11:12:00 2016-01-06 14:24:00 2016-01-06 14:32:00 -3 0 0 15 8 -6 0 0 3 132 448.1818
2016-01-07 AA N434AA 43 DFW Dallas/Fort Worth, TX TX DTW Detroit, MI MI 1100 1056 1110 1416 1438 1426 0 0 158 150 986 2016-01-07 11:00:00 2016-01-07 14:38:00 2016-01-07 10:56:00 2016-01-07 11:10:00 2016-01-07 14:16:00 2016-01-07 14:26:00 -4 0 0 14 10 -12 0 0 8 126 469.5238
2016-01-08 AA N541AA 43 DFW Dallas/Fort Worth, TX TX DTW Detroit, MI MI 1100 1055 1116 1431 1438 1445 0 0 158 170 986 2016-01-08 11:00:00 2016-01-08 14:38:00 2016-01-08 10:55:00 2016-01-08 11:16:00 2016-01-08 14:31:00 2016-01-08 14:45:00 -5 0 0 21 14 7 7 0 -12 135 438.2222
2016-01-09 AA N489AA 43 DFW Dallas/Fort Worth, TX TX DTW Detroit, MI MI 1100 1102 1115 1424 1438 1433 0 0 158 151 986 2016-01-09 11:00:00 2016-01-09 14:38:00 2016-01-09 11:02:00 2016-01-09 11:15:00 2016-01-09 14:24:00 2016-01-09 14:33:00 2 2 0 13 9 -5 0 0 7 129 458.6047
2016-01-10 AA N439AA 43 DFW Dallas/Fort Worth, TX TX DTW Detroit, MI MI 1100 1240 1300 1617 1438 1631 0 0 158 171 986 2016-01-10 11:00:00 2016-01-10 14:38:00 2016-01-10 12:40:00 2016-01-10 13:00:00 2016-01-10 16:17:00 2016-01-10 16:31:00 100 100 1 20 14 113 113 1 -13 137 431.8248
2016-01-11 AA N468AA 43 DFW Dallas/Fort Worth, TX TX DTW Detroit, MI MI 1100 1107 1118 1426 1438 1435 0 0 158 148 986 2016-01-11 11:00:00 2016-01-11 14:38:00 2016-01-11 11:07:00 2016-01-11 11:18:00 2016-01-11 14:26:00 2016-01-11 14:35:00 7 7 0 11 9 -3 0 0 10 128 462.1875
flights |> 
  group_by(day = wday(FlightDate, label=TRUE)) |> 
  mutate(delayed = ifelse(DepDelay > 0, 1, 0)) |> 
  summarize(perc_delay = sum(delayed) / n()) |> 
  ggplot(aes(x = day, y = perc_delay)) + 
  geom_col() +
  theme_minimal()

flights |>  
  mutate(
    delayed = ifelse(DepDelay > 0, 1, 0), 
    time_part = ifelse(CRSDepTime > 1200, "Afternoon/Evening", "Morning")) |> 
  group_by(time_part) |> 
  summarize(perc_delay = sum(delayed) / n()) |> 
  gt() |> 
  fmt_percent(
    columns = perc_delay,
    decimals = 1
  )
time_part perc_delay
Afternoon/Evening 39.7%
Morning 26.2%
flights |> 
  mutate(day = wday(FlightDate, label=TRUE),
         delayed = ifelse(DepDelay > 0, 1, 0), ) |> 
  group_by(weekend = ifelse(day %in% c("Sat", "Sun"), 1, 0)) |> 
  summarize(perc_delay = sum(delayed) / n()) |> 
  gt() |> 
  fmt_percent(
    columns = perc_delay,
    decimals = 1
  )
weekend perc_delay
0 33.5%
1 35.3%
airport_delay <- flights %>%
  filter(Cancelled == 0) %>%
  group_by(Origin) %>%
  summarize(avg_dep_delay = mean(DepDelay, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(avg_dep_delay))

# Visualize the average departure delay by airport
ggplot(airport_delay, aes(x = reorder(Origin, -avg_dep_delay), y = avg_dep_delay)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(
    title = "Average Departure Delays by Airport (January 2016)",
    x = "Airport",
    y = "Average Departure Delay (minutes)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Calculate average departure delays by airline carrier
carrier_delay <- flights %>%
  filter(Cancelled == 0) %>%
  group_by(Carrier) %>%
  summarize(
    avg_dep_delay = mean(DepDelay, na.rm = TRUE),
    avg_arr_delay = mean(ArrDelay, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_dep_delay))

# Visualize average departure and arrival delays by carrier
ggplot(carrier_delay, aes(x = reorder(Carrier, -avg_dep_delay), y = avg_dep_delay, fill = Carrier)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Average Departure Delays by Airline Carrier (January 2016)",
    x = "Airline Carrier",
    y = "Average Departure Delay (minutes)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From the first graph: Highest average departure delays. From the analysis, we find that airports in larger metropolitan areas tend to have longer departure delays. This is likely due to higher traffic volumes and congestion. Specifically, airports like Newark Liberty International (EWR) and JFK International (JFK) in New York, and Chicago O’Hare (ORD) showed the highest average departure delays.Submission

The second graph: Which carriers tend to have the longer delays. The analysis reveals that certain carriers tend to have longer delays than others. Airlines such as Envoy Air and Piedmont Airlines experienced longer average delays, likely due to operational challenges. Meanwhile, larger, more established carriers like American Airlines (AA) and Delta Air Lines (DL) had more consistent on-time performance.To submit your assignment:

1 Grading Rubric

Item
(percent overall)
100% - flawless 67% - minor issues 33% - moderate issues 0% - major issues or not attempted
Question 1 query.
(22%)
Relevant question that is fully answered in the query or queries.
Question 1 visualization or table.
(15%)
Visually pleasant and relevant to the question.
Question 2 query.
(22%)
Relevant question that is fully answered in the query or queries.
Question 2 visualization or table.
(15%)
Visually pleasant and relevant to the question.
Data was subsetted separately from the assignment.
(10%)
You included the description of your subsetted data in your narrative. You subsetted the data but didn’t include the description in the narrative. NA You didn’t subset the data.
Messages and/or errors suppressed from rendered document and all code is shown.
(8%)
Submitted properly to Brightspace
(8%)
NA NA You must submit according to instructions to receive any credit for this portion.