Dataset Overview and Key Metrics

The dataset contains detailed flight records from 2021, capturing various aspects of airline operations. It includes information such as the date of the flight, the operating airline, and the departure and arrival airports. Key metrics like cancellation and diversion status indicate operational challenges, while departure and arrival delays measured in minutes provide insights into punctuality. With identifiers for both airlines and airports, this comprehensive dataset enables in-depth analysis of flight performance, cancellations, delays, and trends across different airlines and time periods.

This section provides a summary of key metrics related to flight performance, including departure delays, cancellations, arrival delays, and diversion statuses. Below are the summary statistics for the relevant columns in the dataset, along with counts of any missing values for each metric.

# Load necessary libraries for data manipulation and visualization 
library(data.table)
library(ggplot2)
library(dplyr)
library(plotly)
# Specify the file path for the dataset
filename <- "C://Users//wesma//OneDrive//Documents//DS-736//R_datafiles//Combined_Flights_2021.csv"
# Read the CSV file into a data frame, treating empty strings and NA as NA
df <- fread(filename, na.strings=c(NA,""))
# Select relevant columns from the data frame for analysis
relevant_columns <- df %>%
  select(DepDelay, Cancelled, ArrDelay, Diverted, Airline, Month)
# Generate summary statistics for the selected relevant columns
summary(relevant_columns)
##     DepDelay       Cancelled          ArrDelay        Diverted      
##  Min.   :-105.00   Mode :logical   Min.   :-107.00   Mode :logical  
##  1st Qu.:  -6.00   FALSE:6200853   1st Qu.: -16.00   FALSE:6296889  
##  Median :  -2.00   TRUE :111018    Median :  -7.00   TRUE :14982    
##  Mean   :   9.47                   Mean   :   3.29                  
##  3rd Qu.:   6.00                   3rd Qu.:   6.00                  
##  Max.   :3095.00                   Max.   :3089.00                  
##  NA's   :108413                    NA's   :126001                   
##    Airline              Month      
##  Length:6311871     Min.   : 1.00  
##  Class :character   1st Qu.: 4.00  
##  Mode  :character   Median : 7.00  
##                     Mean   : 6.97  
##                     3rd Qu.:10.00  
##                     Max.   :12.00  
## 
# Calculate the number of missing values (NA) for each relevant column
colSums(is.na(relevant_columns))
##  DepDelay Cancelled  ArrDelay  Diverted   Airline     Month 
##    108413         0    126001         0         0         0

The analysis reveals that while many flights have minimal departure delays (with a median of -2 minutes), there are significant extremes, including a maximum delay of 3095 minutes. A total of 111,018 flights were canceled, and there are notable missing values in the departure and arrival delay data. Overall, these findings suggest a complex picture of flight performance, with many flights arriving early or on time, alongside a few extreme outliers.

Visual Analysis of Flight Performance

# I paste some code in here if needed. This might be manipulation of the data after reading it in, to remove bad data, for example.

Monthly Insights: Cancellations and Departure Delays

This graph illustrates the monthly cancellations alongside the average departure delay. The blue bars represent the total cancellations for each month, while the red line shows the average delay. Observing trends reveals peak cancellation months, potentially linked to weather or operational challenges. The correlation between high cancellations and delays suggests underlying issues, providing valuable insights into airline performance throughout the year.

Initially, cancellations were relatively low, but they spiked during mid-year, reflecting possible seasonal challenges or operational issues. Delays varied alongside cancellations, with noticeable peaks during certain months, indicating that not only did more flights get canceled, but those that remained operational also experienced longer delays.

monthly_summary <- df %>%
  group_by(Month) %>%
  summarize(Total_Cancelled = sum(Cancelled == T), Average_DepDelay = mean(DepDelay,na.rm = T),.groups = 'drop')

monthly_summary$Month <- factor(monthly_summary$Month, levels = 1:12, labels = month.abb)

max_cancelled <- max(monthly_summary$Total_Cancelled)
max_delay <- max(monthly_summary$Average_DepDelay)
scaling_factor <- max_cancelled / max_delay


p1 <- ggplot(monthly_summary, aes(x = Month)) +
  geom_bar(aes(y = Total_Cancelled), stat = "identity", color="darkblue",fill = "lightblue") +
  geom_line(aes(y = Average_DepDelay * scaling_factor), 
            color = "red", size = 1, group=1) +
  scale_y_continuous(sec.axis = sec_axis(~./scaling_factor,name = "Average Departure Delay (minutes)")) +
  labs(title = "Monthly Cancellations and Average Departure Delay",x = "Month", y = "Number of Flights Cancelled") +
  theme(plot.title = element_text(hjust = 0.5), axis.text.y = element_text(color = "darkblue"),axis.title.y = element_text(color = "darkblue"),
      axis.text.y.right = element_text(color = "red"),axis.title.y.right = element_text(color = "red"))
p1

Top 10 Airlines: Monthly Average Arrival Delay Comparison

The line graph shows the average arrival delays for the top ten airlines by month. Each line represents a different airline, making it easy to see how their performance changes over time. This graph helps viewers compare airlines to see which ones are more punctual or improving. Knowing these trends can help travelers choose reliable airlines.

airlines_df <- df %>%
  count(Airline) %>%                     
  top_n(10, n)

top_Airlines_df <- df %>%
  filter(Airline %in% airlines_df$Airline) %>%
  filter(!is.na(ArrDelay))

monthly_airline_delay <- top_Airlines_df %>%
  group_by(Month, Airline) %>%
  summarize(Average_ArrDelay = mean(ArrDelay), .groups = 'drop')

monthly_airline_delay$Month <- factor(monthly_airline_delay$Month, levels = 1:12, labels = month.abb)

p2 <- ggplot(monthly_airline_delay, aes(x = Month, y = Average_ArrDelay, color = Airline, group=Airline)) +
  geom_line(size = 1) +
  labs(title = "Average Arrival Delay by Top 10 Airlines Over Months",x = "Month", y = "Average Arrival Delay (minutes)") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5))+
  geom_point(shape=21, size=2, color="black", fill="white")
p2

Monthly Flight Statistics: Total Flights vs. Average Departure Delays

This pie chart shows the monthly flight distribution alongside average departure delays. The outer segments represent total flights, while the inner pie reflects average delays. This dual perspective allows viewers to see how flight volume affects delays, with high flight months potentially correlating to increased delays. It serves as a strategic tool for airlines during peak travel periods.

flights_month <- df %>%
  group_by(Month) %>%
  summarise(TotalFlights = n(), AverageDelay=mean(DepDelay, na.rm=T)) %>%
  mutate(Month = factor(Month, levels=1:12, labels=month.abb))
  

p3 <- plot_ly(hole=0.7) %>%
  layout(title = "", showlegend = TRUE, legend = list(title=list(text="Months"), traceorder="normal"),annotations=list(text="Monthly Flight Distribution<br> and Average Departure Delays", "showarrow"=F))%>%
  add_trace(data=flights_month, labels=~Month, values=~TotalFlights,
            textinfo = "label + percent", type="pie", textposition="inside", name="Total Flights") %>%
  add_trace(data = flights_month, labels = ~Month, values = ~AverageDelay,
            textinfo = "label + percent",type="pie",textposition="inside", name = "Average Delay",
            domain = list(
              x=c(0.16,0.84),
              y=c(0.16, 0.84)))
p3

Average Departure Delays by State and Month

The heatmap displays average departure delays by state and month, with color indicating how long the delays are. This design highlights differences across regions and seasons, showing which states tend to have more significant delays. This information can help airlines identify areas for improvement, while travelers can understand where they might face challenges with on-time departures.

average_delay_state_month = df %>%
  group_by(OriginState, Month) %>%
  summarise(StateAverageDelay = mean(DepDelay, na.rm=T))

average_delay_state_month$Month <- factor(average_delay_state_month$Month,levels = 1:12,labels = month.abb)
  
p4 <- ggplot(average_delay_state_month, aes(x = Month, y=OriginState, fill=StateAverageDelay)) +
  geom_tile(color="black") +
  geom_text(aes(label=round(StateAverageDelay, 1)), color="black", size=3)+
  labs(title = "Average Departure Delay by State", x = "Month", y = "State") +
  scale_fill_gradient(low="white", high="red", name="Average Delay (min)")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_discrete(breaks=month.abb)
p4

Flight Status Distribution for Top 10 Airlines

The set of pie charts illustrates the flight status for the top ten airlines, displaying the proportions of normal, cancelled, and diverted flights. This visualization facilitates easy comparisons between airlines, emphasizing their reliability. Travelers can use this information to make informed choices, while airline management can identify areas for improvement in operational efficiency.

top_airlines <- df %>%
  group_by(Airline) %>%
  summarise(TotalFlights = n()) %>%
  top_n(10, TotalFlights)%>%
  pull(Airline) 
#top_airlines

cancel_df <- df %>%
  filter(Airline %in% top_airlines) %>%
  select(Airline, Cancelled, Diverted) %>%
  mutate(status = ifelse(Cancelled==1, "Cancelled", ifelse(Diverted==1, "Diverted", "Normal"))) %>%
  group_by(Airline, status)%>%
  summarise(n=length(status), .groups='keep') %>%
  group_by(Airline)%>%
  mutate(percent_of_total = round(100*n/sum(n),1)) %>%
  ungroup() %>%
  data.frame()

    
cancel_df$status = factor(cancel_df$status, levels=c("Normal","Cancelled","Diverted"))
p5 <- ggplot(data=cancel_df, aes(x="", y=n, fill=status)) +
  geom_bar(stat="identity", position="fill") +
  coord_polar(theta="y", start=0) +
  labs(fill="Flight Status", x=NULL, y=NULL, title="Flight Status for Top 10 Airlines",
       caption="Slices under 0.5% are not Labeled") +
  theme_light() +
  theme(plot.title=element_text(hjust=0.5), 
        axis.text=element_blank(), 
        axis.ticks = element_blank(),
        panel.grid=element_blank()) +
  facet_wrap(~Airline, 3, 4) +
  scale_fill_brewer(palette = "Reds")+
  geom_text(aes(x=1.7, label=ifelse(percent_of_total>0.4,paste0(percent_of_total, "%"),"")), 
            size=4,
            position=position_fill(vjust=0.5))
p5

Summary

This analysis focuses on flight performance in 2021 by examining monthly cancellations, delays by airline, and service variations across regions. By visualizing monthly cancellations alongside average departure delays, we can easily identify trends and patterns throughout the year. This helps us see when cancellations were high and how they related to delays, providing a clearer picture of airline performance over time.

The pie charts for the top airlines break down their flight statuses normal, canceled, and diverted offering insight into their reliability. If an airline has a high number of canceled or diverted flights, it raises concerns about its service quality.

The heatmap displays average departure delays by state and month, highlighting where delays are most significant. This visualization helps identify which states experience more frequent delays, making it useful for travelers considering their options and for airlines looking to improve their operations.

Together, these insights enhance our understanding of the aviation industry in 2021, pinpointing key issues and opportunities for improvement. This analysis serves as a valuable resource for travelers, airline management, and anyone interested in air travel trends.