Assignment 7

Open the assign07.qmd file and complete the exercises.

This is somewhat open-ended assignment.

The file domestic_flights_jan_2016.csv is nearly the same as the me_flights.csv file we work with in the temporal data chapter except it has two additional columns for destination city and state and it is for all domestic flights reported for on-time performance in the US for January 2016. You’ll find it helpful to recreate all of the calculated fields we create in the unit.
You are to write a report uploaded to RPubs that compares what you consider interesting metrics for a select group of carriers, airports, or states. You may not parrot the queries from the text or the practice questions. Your report must contain at least two questions that you ask about the flights data. Your answers to those questions must also contain a visualization of the data or a table along with a specific answer in the narrative.

The csv file contains 445,827 observations. You’ll want to subset the data to the area(s) you are looking at, then write it out to a csv file using write_csv(), and start your assignment by importing that csv instead. Do this in a separate script file that you don’t need to submit. In your narrative, describe your subset. I don’t need to see how you subsetted the data because it might cause performance issues when you render the document. Note: you will receive deductions for not using tidyverse syntax in this assignment. That includes the use of filter, mutate, and the up-to-date pipe operator |>.

The Grading Rubric is available at the end of this document.

This is your work area. Add as many code cells as you need.

# Load necessary libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(ggplot2)

1 Intro

This report analyzes domestics flights in the US during January 2016

2 Question 1 Which Airport had the highest average departure delay and how do they vary based on the day of the week

# Define column types explicitly for reading the data
col_types <- cols(
  FlightDate = col_date(),         # Date
  Carrier = col_character(),       # Character (string)
  TailNum = col_character(),       # Character (string)
  FlightNum = col_character(),     # Character (string)
  Origin = col_character(),        # Character (string)
  OriginCityName = col_character(),# Character (string)
  OriginState = col_character(),   # Character (string)
  Dest = col_character(),          # Character (string)
  DestCityName = col_character(),  # Character (string)
  DestState = col_character(),     # Character (string)
  CRSDepTime = col_double(),       # Numeric (double) for scheduled departure time
  DepTime = col_double(),          # Numeric (double) for actual departure time
  WheelsOff = col_double(),        # Numeric (double) for wheels off time
  WheelsOn = col_double(),         # Numeric (double) for wheels on time
  CRSArrTime = col_double(),       # Numeric (double) for scheduled arrival time
  ArrTime = col_double(),          # Numeric (double) for actual arrival time
  Cancelled = col_integer(),       # Integer for cancelled flights
  Diverted = col_integer(),        # Integer for diverted flights
  CRSElapsedTime = col_double(),   # Numeric (double) for scheduled elapsed time
  ActualElapsedTime = col_double(),# Numeric (double) for actual elapsed time
  Distance = col_double()          # Numeric (double) for distance
)

# Load the dataset with the defined column types
flights <- read_csv("flights_subset.csv", col_types = col_types)

# Convert columns to appropriate types
flights <- flights %>%
  mutate(
    DepTime = as.numeric(DepTime),  # Convert DepTime to numeric
    CRSDepTime = as.numeric(CRSDepTime),  # Convert CRSDepTime to numeric
    DepDelay = DepTime - CRSDepTime,  # Calculate departure delay
    FlightDate = ymd(FlightDate),  # Convert FlightDate to Date format
    day = wday(FlightDate, label = TRUE)  # Add day of the week
  )

# Step 1: Calculate average departure delay by airport and day of week
avg_dep_delay <- flights |> 
  mutate(
    DepTime = as.numeric(DepTime),  # Ensure DepTime is numeric
    CRSDepTime = as.numeric(CRSDepTime),  # Ensure CRSDepTime is numeric
    DepDelay = DepTime - CRSDepTime,  # Calculate departure delay
    day = wday(ymd(FlightDate), label = TRUE)  # Add day of the week
  ) |> 
  filter(!is.na(DepDelay)) |>  # Remove rows where DepDelay is NA
  group_by(Origin, day) |> 
  summarise(
    avg_dep_delay = mean(DepDelay, na.rm = TRUE),  # Calculate average delay
    .groups = 'drop'
  )

#Subset the Data for 3 busiest airports
flights_subset <- flights |>
  filter(origin %in% c("ATL", "ORD", "DFW"))

# Step 2: Plot average departure delay by airport and day of week
ggplot(avg_dep_delay, aes(x = day, y = avg_dep_delay, fill = Origin)) +
  geom_col(position = "dodge") +
  labs(
    title = "Average Departure Delay by Day of Week",
    x = "Day of Week",
    y = "Average Departure Delay (minutes)",
    fill = "Airport"
  ) +
  theme_minimal()

# Step 3: Calculate cancellation percentage by airport
cancel_pct <- flights |> 
  group_by(Origin) |> 
  summarise(
    cancelled = sum(Cancelled, na.rm = TRUE),
    total = n(),
    pct_cancelled = (cancelled / total) * 100
  )

#Question 2 what percentage of flights were cancelled

# Step 4: Display cancellation percentage table
cancel_pct |> 
  arrange(desc(pct_cancelled)) |> 
  knitr::kable(digits = 2, col.names = c("Airport", "Cancelled Flights", "Total Flights", "% Cancelled"))

Airport	Cancelled Flights	Total Flights	% Cancelled

# Question 1: Which airport had the highest average departure delay, and how did delays vary by the day of the week?
# Calculate the average departure delay by airport and day of the week
avg_dep_delay <- flights_subset %>%
  group_by(Origin, day) %>%
  summarise(avg_dep_delay = mean(DepDelay, na.rm = TRUE), .groups = 'drop')

# Plot the average departure delay
ggplot(avg_dep_delay, aes(x = day, y = avg_dep_delay, fill = Origin)) +
  geom_col(position = "dodge") +
  labs(
    title = "Average Departure Delay by Day of Week",
    x = "Day of Week",
    y = "Average Departure Delay (minutes)",
    fill = "Airport"
  ) +
  theme_minimal()

2.1 Submission

To submit your assignment:

Change the author name to your name in the YAML portion at the top of this document
Render your document to html and publish it to RPubs.
Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

2.2 Grading Rubric

Item (percent overall)	100% - flawless	67% - minor issues	33% - moderate issues	0% - major issues or not attempted
Question 1 query. (22%)	Relevant question that is fully answered in the query or queries.
Question 1 visualization or table. (15%)	Visually pleasant and relevant to the question.
Question 2 query. (22%)	Relevant question that is fully answered in the query or queries.
Question 2 visualization or table. (15%)	Visually pleasant and relevant to the question.
Data was subsetted separately from the assignment. (10%)	You included the description of your subsetted data in your narrative.	You subsetted the data but didn’t include the description in the narrative.	NA	You didn’t subset the data.
Messages and/or errors suppressed from rendered document and all code is shown. (8%)
Submitted properly to Brightspace (8%)		NA	NA	You must submit according to instructions to receive any credit for this portion.