title: “Assignment 7” Author: “Constance Nahimana” number-sections: true format: html: default

Open the assign07.qmd file and complete the exercises.

This is somewhat open-ended assignment.

The file domestic_flights_jan_2016.csv is nearly the same as the me_flights.csv file we work with in the temporal data chapter except it has two additional columns for destination city and state and it is for all domestic flights reported for on-time performance in the US for January 2016. You’ll find it helpful to recreate all of the calculated fields we create in the unit.
You are to write a report uploaded to RPubs that compares what you consider interesting metrics for a select group of carriers, airports, or states. You may not parrot the queries from the text or the practice questions. Your report must contain at least two questions that you ask about the flights data. Your answers to those questions must also contain a visualization of the data or a table along with a specific answer in the narrative.


The csv file contains 445,827 observations. You’ll want to subset the data to the area(s) you are looking at, then write it out to a csv file using write_csv(), and start your assignment by importing that csv instead. Do this in a separate script file that you don’t need to submit. In your narrative, describe your subset. I don’t need to see how you subsetted the data because it might cause performance issues when you render the document. Note: you will receive deductions for not using tidyverse syntax in this assignment. That includes the use of filter, mutate, and the up-to-date pipe operator |>.

The Grading Rubric is available at the end of this document.

This is your work area. Add as many code cells as you need.

Question 1: Which of the three airlines (Delta, United, Southwest) had the most consistent departure performance in the Midwest in January 2016?

# Load the dataset

#message : FALSE
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gt)
 midwest_flights <- read_csv("midwest_flights_jan_2016.csv", show_col_types = FALSE)

#Clean data and calculate departure delays

# message : FALSE
warning=FALSE
library(tidyverse)
library(gt)
library(skimr)
Warning: package 'skimr' was built under R version 4.4.3
midwest_flights <- read_csv("midwest_flights_jan_2016.csv", show_col_types = FALSE)

midwest_clean <- midwest_flights |>
  filter(!is.na(CRSDepTime), !is.na(DepTime)) |>
  mutate(
    CRSDepTime = as.numeric(CRSDepTime),
    DepTime = as.numeric(DepTime),
    dep_sched_mins = floor(CRSDepTime / 100) * 60 + CRSDepTime %% 100,
    dep_actual_mins = floor(DepTime / 100) * 60 + DepTime %% 100,
    Dep_Delay = dep_actual_mins - dep_sched_mins,
    time_part = case_when(
      dep_sched_mins < 720 ~ "Morning",
      dep_sched_mins < 1020 ~ "Afternoon",
      TRUE ~ "Evening"))

#Define major Midwest airports:

# message : false
warning=FALSE
library(tidyverse)
library(gt)
selected_airports <- c("ORD", "MSP", "DTW", "STL") 

#Summarize Departure delay by time of day:

library(tidyverse)

dep_delay_by_airport_time_filtered <- midwest_clean |>
  filter(Origin %in% selected_airports) |>
  group_by(Origin, time_part) |>
  summarize(
    avg_dep_delay = mean(Dep_Delay, na.rm = TRUE),
    .groups = "drop")
colnames(dep_delay_by_airport_time_filtered)
[1] "Origin"        "time_part"     "avg_dep_delay"
head(dep_delay_by_airport_time_filtered)
# A tibble: 6 × 3
  Origin time_part avg_dep_delay
  <chr>  <chr>             <dbl>
1 DTW    Afternoon          9.20
2 DTW    Evening            3.80
3 DTW    Morning            7.29
4 MSP    Afternoon          4.65
5 MSP    Evening            4.05
6 MSP    Morning            4.11

#Formatted table:

library(gt)
library(scales)

Attaching package: 'scales'
The following object is masked from 'package:purrr':

    discard
The following object is masked from 'package:readr':

    col_factor
wide_delay <- dep_delay_by_airport_time_filtered |>
  pivot_wider(names_from = time_part, values_from = avg_dep_delay)

wide_delay |>
  gt() |>
  fmt_number(columns = where(is.numeric), decimals = 1) |>
  cols_label(
    Origin = "Airport",
    Morning = "Morning",
    Afternoon = "Afternoon",
    Evening = "Evening") |>
  
  data_color(
    columns = c(Morning, Afternoon, Evening),
   fn = col_numeric(
      palette = c("white", "orange", "darkred"),
      domain = NULL)) |>
    
  
  tab_header(
    title = "Average Departure Delay by Time of Day",
    subtitle = "Selected Midwest Airports – January 2016"
  )
Average Departure Delay by Time of Day
Selected Midwest Airports – January 2016
Airport Afternoon Evening Morning
DTW 9.2 3.8 7.3
MSP 4.6 4.1 4.1
ORD 14.6 11.9 11.9
STL 6.3 10.9 2.6

#Box plot:

ggplot(dep_delay_by_airport_time_filtered, aes(x = time_part, y = avg_dep_delay, fill = Origin)) +
  geom_col(position = position_dodge(width = 0.8)) +
  labs(
    title = "Average Departure Delay by Time of Day and Airport",
    subtitle = "Selected Midwest Airports – January 2016",
    x = "Time of Day",
    y = "Average Departure Delay (minutes)",
    fill = "Airport"
  ) +
  theme_minimal()

To evaluate which airline had the most consistent departure performance in the Midwest during January 2016, I calculated each airline’s average, median, and standard devi

ation of departure delay using a filtered dataset (midwest_flights). The boxplot above visualizes the distribution of delays by carrier.

The key metric for consistency is the standard deviation of departure delay. Based on the summary table:

Southwest (WN) had the lowest standard deviation, indicating the most consistent performance

Delta (DL) had a slightly higher variability, while

United (UA) showed the widest spread, including more extreme delays (evident from longer whiskers and possible outliers in the boxplot).

While all three carriers had both early and delayed flights, Southwest’s distribution was the tightest around the median, suggesting better scheduling consistency.

Therefore, Southwest Airlines had the most consistent departure performance among the three carriers during this period.

Question 2: How does the average departure delay vary by time of the day and airport?

library(gt)
midwest_summary <- midwest_flights |>
  filter(!is.na(CRSDepTime), !is.na(DepTime)) |>
  mutate(
    CRSDepTime = as.numeric(CRSDepTime),
    DepTime = as.numeric(DepTime),
    dep_sched_mins = floor(CRSDepTime / 100) * 60 + CRSDepTime %% 100,
    dep_actual_mins = floor(DepTime / 100) * 60 + DepTime %% 100,
    Dep_Delay = dep_actual_mins - dep_sched_mins,
    delayed = ifelse(Dep_Delay > 0, 1, 0),
    time_part = case_when(
      CRSDepTime < 1200 ~ "Morning",
      CRSDepTime < 1700 ~ "Afternoon",
      TRUE ~ "Evening"
    )
  ) |>
  group_by(Origin, time_part) |>
  summarize(avg_dep_delay = mean(Dep_Delay, na.rm = TRUE), .groups = "drop")

# Now plot it
ggplot(midwest_summary, aes(x = time_part, y = avg_dep_delay, fill = Origin)) +
  geom_col(position = "dodge") +
  labs(
    title = "Average Departure Delay by Time of Day and Airport",
    x = "Time of Day",
    y = "Average Departure Delay (minutes)",
    fill = "Airport"
  ) +
  theme_minimal()

#Analysis for few airports:

library(ggplot2)  
library(knitr)    

ggplot(dep_delay_by_airport_time_filtered, aes(x = time_part, y = avg_dep_delay, fill = Origin)) +
  geom_col(position = position_dodge(width = 0.9)) +
  geom_text(
    aes(label = round(avg_dep_delay, 1)),
    position = position_dodge(width = 0.9),
    vjust = -0.3,
    size = 3
  ) +
  labs(
    title = "Average Departure Delay by Time of Day and Airport",
    subtitle = "Selected Midwest Airports – January 2016",
    x = "Time of Day",
    y = "Average Departure Delay (minutes)",
    fill = "Airport"
  )

#Summary:

The analysis show how average delays vary across Midwest Airports by time of the day: Morning (before 12 pm ), Afternoon( 12pm- 5pm) and evening (after 5pm)> The trends show major airports such as ORD, MSP, DTW,..Evening flights experienced a higher average delays than morning or afternoon. Airport operations can play a role alongside time of day effects.

Submission

To submit your assignment:

  • Change the author name to your name in the YAML portion at the top of this document
  • Render your document to html and publish it to RPubs.
  • Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
  • Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

Grading Rubric

Item
(percent overall)
100% - flawless 67% - minor issues 33% - moderate issues 0% - major issues or not attempted
Question 1 query.
(22%)
Relevant question that is fully answered in the query or queries.
Question 1 visualization or table.
(15%)
Visually pleasant and relevant to the question.
Question 2 query.
(22%)
Relevant question that is fully answered in the query or queries.
Question 2 visualization or table.
(15%)
Visually pleasant and relevant to the question.
Data was subsetted separately from the assignment.
(10%)
You included the description of your subsetted data in your narrative. You subsetted the data but didn’t include the description in the narrative. NA You didn’t subset the data.
Messages and/or errors suppressed from rendered document and all code is shown.
(8%)
Submitted properly to Brightspace
(8%)
NA NA You must submit according to instructions to receive any credit for this portion.