Assignment 7

Open the assign07.qmd file and complete the exercises.

This is somewhat open-ended assignment.

The file domestic_flights_jan_2016.csv is nearly the same as the me_flights.csv file we work with in the temporal data chapter except it has two additional columns for destination city and state and it is for all domestic flights reported for on-time performance in the US for January 2016. You’ll find it helpful to recreate all of the calculated fields we create in the unit.
You are to write a report uploaded to RPubs that compares what you consider interesting metrics for a select group of carriers, airports, or states. You may not parrot the queries from the text or the practice questions. Your report must contain at least two questions that you ask about the flights data. Your answers to those questions must also contain a visualization of the data or a table along with a specific answer in the narrative.

The csv file contains 445,827 observations. You’ll want to subset the data to the area(s) you are looking at, then write it out to a csv file using write_csv(), and start your assignment by importing that csv instead. Do this in a separate script file that you don’t need to submit. In your narrative, describe your subset. I don’t need to see how you subsetted the data because it might cause performance issues when you render the document. Note: you will receive deductions for not using tidyverse syntax in this assignment. That includes the use of filter, mutate, and the up-to-date pipe operator |>.

The Grading Rubric is available at the end of this document.

This is your work area. Add as many code cells as you need.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)

subset_flights <- read_csv("subset_flights_CO_ME_UA_AA.csv")

Rows: 11831 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): FlightDate, Carrier, TailNum, Origin, OriginCityName, OriginState,...
dbl (12): FlightNum, CRSDepTime, DepTime, WheelsOff, WheelsOn, CRSArrTime, A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Description of Subset: To ease the process of data visualization I decided to come up with my two questions before I subsetted the data. This gave me a clear idea of what I needed to include in my data set and what I could leave out. I decided to focus on the following: United Airlines and American Airlines since these are the airlines I fly most consistently, along with this I wanted to compare focus on Maine and Colorado as they are the states I spend the most time in.

To subset the data, I filtered for flights where either the origin or destination state was Colorado (CO) or Maine (ME), and the carrier was either UA or AA. I kept all original variables to allow for flexibility in my analysis, in case I wanted to adjust my research questions later. However, I did transform a few key variables to support my analysis more effectively. For example, I converted the scheduled and actual arrival times into proper datetime format using the lubridate package, which allowed me to calculate accurate arrival delays in minutes.

Question 1: Comparing Colorado or Maine which state had a higher rate of flight cancellations for incoming United and American Airlines flights?

To address this question, I grouped all flights by destination state (Colorado or Maine) and carrier (UA and AA), then calculated the percentage of flights that were canceled. This method allowed me to directly compare how often each airline canceled flights to each state.

The results showed that American Airlines had the highest combined cancellation rate at around 10%, while United Airlines was at around 2%. For United Airline flights with a destination to Maine there were no cancellations, but almost 8% of all flights to Maine in January 2016 flown by American Airlines were canceled.

This may reflect weather, route volume, or airline-level scheduling decisions. By isolating only two states and two major carriers, this analysis offers a focused look at cancellation patterns that are personally relevant to me.

The reason this was relevant to me is that I have flown both of these airlines into and out of both states, and multiple times have had a flight canceled for unknown reasons. While I have never flown to either states during the month of January it is interesting to see these numbers as they might be higher considering the brutal winters that both of these states face.

The following graph is a visualization of the results section.

library(tidyverse)

cancellation_summary <- subset_flights |>
  filter(!is.na(Cancelled)) |>
  group_by(DestState, Carrier) |>
  summarise(
    total_flights = n(),
    cancelled_flights = sum(Cancelled == 1),
    cancellation_rate = cancelled_flights / total_flights * 100
  ) |>
  arrange(desc(cancellation_rate))

`summarise()` has grouped output by 'DestState'. You can override using the
`.groups` argument.

cancellation_summary |>
  filter(DestState %in% c("ME", "CO")) |>
  ggplot(aes(x = DestState, y = cancellation_rate, fill = Carrier)) +
  geom_col(position = "dodge") +
  labs(title = "Cancellation Rates for Flights Arriving to Maine and Colorado",
       x = "State",
       y = "Cancellation Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0),
        plot.title = element_text(size = 14, face = "bold"))

Question 2: Between United and American Airlines, which had longer average arrival delays for flights arriving in Colorado and Maine?

To address question 2, I used the difference between the actual and scheduled arrival times, converted to minutes using lubridate. I then calculated the average delay by carrier for flights arriving in Colorado and Maine.

The results show that American Airlines flights arriving to Colorado were on average about .25 minutes late, while AA flights arriving to Maine were about .25 minutes ahead of schedule. United Airlines flights were neither ahead or behind schedule when flying into Maine but were on average .35 minutes ahead of schedule when arriving in Colorado.

The values from being delayed or being ahead of schedule are a negligible amount when considering buying flights as none of them are even over 30 seconds delayed or ahead.

The following graph is a visualization of the results section.

arrival_delay_summary <- subset_flights |>
  filter(!is.na(ArrTime) & !is.na(CRSArrTime)) |>
  mutate(arr_delay_mins = as.numeric(difftime(ArrTime, CRSArrTime, units = "mins"))) |>
  group_by(DestState, Carrier) |>
  summarise(
    average_arrival_delay = mean(arr_delay_mins, na.rm = TRUE),
    num_flights = n(),
    .groups = "drop"
  )

arrival_delay_summary |>
  filter(DestState %in% c("ME", "CO")) |>
  ggplot(aes(x = DestState, y = average_arrival_delay, fill = Carrier)) +
  geom_col(position = "dodge") +
  labs(title = "Average Arrival Delays for Maine and Colorado",
       x = "Destination State",
       y = "Average Arrival Delay (minutes)") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0),
    plot.title = element_text(size = 16, face = "bold")
  )

1 Submission

To submit your assignment:

Change the author name to your name in the YAML portion at the top of this document
Render your document to html and publish it to RPubs.
Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

2 Grading Rubric

Item (percent overall)	100% - flawless	67% - minor issues	33% - moderate issues	0% - major issues or not attempted
Question 1 query. (22%)	Relevant question that is fully answered in the query or queries.
Question 1 visualization or table. (15%)	Visually pleasant and relevant to the question.
Question 2 query. (22%)	Relevant question that is fully answered in the query or queries.
Question 2 visualization or table. (15%)	Visually pleasant and relevant to the question.
Data was subsetted separately from the assignment. (10%)	You included the description of your subsetted data in your narrative.	You subsetted the data but didn’t include the description in the narrative.	NA	You didn’t subset the data.
Messages and/or errors suppressed from rendered document and all code is shown. (8%)
Submitted properly to Brightspace (8%)		NA	NA	You must submit according to instructions to receive any credit for this portion.