Initial Loading

After loading in the necessary packages, I proceed to read the csv file within Rstudios into the delays_table vector.

delays_table <- read_csv("C:/Users/aldai/Documents/DATA 607/DATA607 Week 4/607Week4.csv")
delays_table
## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

Cleaning the Data in R

Using pipes and an alternate way of for-looping, the various missing data values, along with the obscurity of the column/row names or lack of is what makes this data table messy. The first subsection/vector assigns some needed columns to the data such as “Destination” for the places of the flights as well as “Flights” as the total count of flights that have either made it “on time” or “delayed”.

The original excel data assumes the names of the airlines and status columns as ‘…1’ and ‘…2’ so that gets renamed to more appropriately distinguish. The “fill” vector will simply assign the following airlines to their respective NA values as there are delayed rows without the airline name.

Lastly, I filter out any remaining NA values assuming its only missing both the flights count and the status, meaning that data was left behind after the cleanup.

flights_clean <- delays_table %>%
  pivot_longer(
    cols = c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
    names_to = "Destination",
    values_to = "Flights"
  ) %>%
  rename(Airline = `...1`, Status = `...2`) %>%
  fill(Airline, .direction = "down") %>%
  filter(!is.na(Status), !is.na(Flights))
flights_clean
## # A tibble: 20 × 4
##    Airline Status  Destination   Flights
##    <chr>   <chr>   <chr>           <dbl>
##  1 ALASKA  on time Los Angeles       497
##  2 ALASKA  on time Phoenix           221
##  3 ALASKA  on time San Diego         212
##  4 ALASKA  on time San Francisco     503
##  5 ALASKA  on time Seattle          1841
##  6 ALASKA  delayed Los Angeles        62
##  7 ALASKA  delayed Phoenix            12
##  8 ALASKA  delayed San Diego          20
##  9 ALASKA  delayed San Francisco     102
## 10 ALASKA  delayed Seattle           305
## 11 AM WEST on time Los Angeles       694
## 12 AM WEST on time Phoenix          4840
## 13 AM WEST on time San Diego         383
## 14 AM WEST on time San Francisco     320
## 15 AM WEST on time Seattle           201
## 16 AM WEST delayed Los Angeles       117
## 17 AM WEST delayed Phoenix           415
## 18 AM WEST delayed San Diego          65
## 19 AM WEST delayed San Francisco     129
## 20 AM WEST delayed Seattle            61

Analysis and Visualization

The following code is used to summerize the table in terms of the delayed flights followed by the total flights of each airline. Additionally, I provide visualization using a bar graph to show the differences in the ratio of delays.

Based on the summary, Alaska had airlines had a total of 501 delayed flights while America West Airlines had 787 delayed flights, which is more. However, if we look into the ratios of the total delayed flights, Alaska actually has 13.27% delayed ratio while America West Airline has 10.89%. The visualization of this data in the form of a bar graph also proves this point, complimenting the differences between the two airlines.

flights_clean %>%
  group_by(Airline) %>%
  summarise(
    delayed_flights = sum(Flights[Status == "delayed"]),
    total_flights   = sum(Flights),
    ratio_delayed   = delayed_flights / total_flights
  )
## # A tibble: 2 × 4
##   Airline delayed_flights total_flights ratio_delayed
##   <chr>             <dbl>         <dbl>         <dbl>
## 1 ALASKA              501          3775         0.133
## 2 AM WEST             787          7225         0.109
delay_summary <- flights_clean %>%
  group_by(Airline) %>%
  summarise(
    delayed_flights = sum(Flights[Status == "delayed"]),
    total_flights   = sum(Flights),
    ratio_delayed   = delayed_flights / total_flights
  )
ggplot(delay_summary, aes(x = Airline, y = ratio_delayed, fill = Airline)) +
  geom_col()

Conclusion

After changing the format to better display the data and to paint a clearer picture for my self, I use the following data to compare the total flights, delays, and on times. This exchange lead me to calcuate the ratios of the flights to determine which airline is more likely to be delayed. Based on my calculations and summary, America West Airlines is less likely to delay flights with flights being 10.89% delayed, in comparison to Alaska Airlines’ flights being 13.27% delayed.