Introduction

The objective of this assignment is to read in a csv file that has the flight delays for two airlines in a wide format. Then to use tidyr and dplyr to tidy and transform the data. Lastly, to perform an analysis to compare the arrival delays for the two airlines.

Tidying and Transforming Data

First, load required packages.

library(tidyverse)
library(kableExtra)

Next, read in the csv file and take a look.

# df <- read.csv(choose.files())
df <-read.csv("https://raw.githubusercontent.com/klgriffen96/spring23_data607_hw5/main/flights.csv")
kable(head(df))  |>
  kable_styling("striped")
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1,841
delayed 62 12 20 102 305
NA NA NA
AMWEST on time 694 4,840 383 320 201
delayed 117 415 65 129 61

Rename headers.

df <- df |> rename(Airline = colnames(df)[1], Status = colnames(df)[2])

Now, remove unnecessary blanks, fill in airlines, and ensure all are numeric.

# Take only the rows which have values
df <- subset(df, nchar(Status) > 0)

# Fill in airline, even when it isn't there
for (i in 1:dim(df)[1]){
  temp <- nchar(df$Airline[i])
  if (temp > 0) {
    airline <- df$Airline[i]
  } else {
    df$Airline[i] <- airline
  }
}

# Make sure all values are numeric
n <- colnames(df)
for (i in 3:length(n)){
  df[[n[i]]] <- as.numeric(str_remove(df[[n[i]]], ","))
}
kable(head(df))  |>
  kable_styling("striped")
Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
4 AMWEST on time 694 4840 383 320 201
5 AMWEST delayed 117 415 65 129 61

Perform the pivot to go from wide format to long format so that each Airline, Airport combination is a row, then back to a wide format so that the on time and delay are their own columns.

df_long <-df |>
   pivot_longer(cols = Los.Angeles:Seattle,
               names_to = "Airport", 
               values_to = "count")

df_wide <- df_long |>
   pivot_wider(names_from = Status,
               values_from = count
   )

kable(head(df_wide))  |>
  kable_styling("striped")
Airline Airport on time delayed
ALASKA Los.Angeles 497 62
ALASKA Phoenix 221 12
ALASKA San.Diego 212 20
ALASKA San.Francisco 503 102
ALASKA Seattle 1841 305
AMWEST Los.Angeles 694 117

Analysis of Arrival Delays

To analyze the arrival delays, do a basic comparison of the overall percent of delayed flights were for each airline, then each airline by each location.

percent_delayed <- df_wide |>
     group_by(Airline, Airport) |> 
     summarize(
         Percent_Delayed = round(`delayed`*100/(`delayed` + `on time`))
         , .groups = "drop") |>
    arrange(desc(Percent_Delayed))

overall_delayed <- df_wide |>
    group_by(Airline) |>
    summarise(
        Overall_Percent_Delayed = round(sum(`delayed`)*100/sum(`delayed` + `on time`))
    )

# Display table - head 
kable(head(percent_delayed))  |>
  kable_styling("striped")
Airline Airport Percent_Delayed
AMWEST San.Francisco 29
AMWEST Seattle 23
ALASKA San.Francisco 17
AMWEST San.Diego 15
ALASKA Seattle 14
AMWEST Los.Angeles 14
# Display overall percent 
kable(overall_delayed) |>
  kable_styling("striped")
Airline Overall_Percent_Delayed
ALASKA 13
AMWEST 11

Alaska airline overall delayed about 13% of it’s flights and AM West overall delayed about 11% of its flights. The highest percent of flights delayed by Alaska were to San Francisco at 17% and for AM West it was also San Francisco at 29%.

Create a bar chart showing the ratios of on-time and delayed for each location.

ggplot(percent_delayed, aes(x=Airport, y=Percent_Delayed, fill= Airline)) + 
    geom_bar(stat="identity" , position=position_dodge())

The bar chart makes it clear that across all destinations, AM West had a higher percent of delayed flights than Alaska did. This is an interesting observation because overall - Alaska had a greater percentage of delays than AM West - but when you look at each destination individually you can see that AM West had a greater percentage of delays by each destination.

Conclusion

In conclusion, I read in a csv file that had the flight delays for two airlines in a wide format. Then I used tidyr and dplyr to tidy and transform the data. Lastly, I performed an analysis to compare the arrival delays for the two airlines.