Approach

My approach for Assignment 5A Air delays is to create the data frame based on the table in the pdf attachment as a CSV file. Upload that csv file then code then display it and perform count analyst. Checking for which airline and in which cities experienced the most delays and the most on times. After exploring more of the data I may also bring up any other interesting counts the I notice. Then transform the data from wide to long. The end I will show the percentage of on time and arrivals for each city split based on airlines.

Challenges

I don’t expected the recreation of the file from the table will be too challenging but I will know when I attempted to recreate the same results. Transforming the data from wide to long is new to me but I do not expect that to be too difficult either. From what I’ve read I just need to use the pivot_longer function in tidyr to transform it. The last thing I expect may be tricky is to accurate calculate and interpreted the percentage ratios of the airline flight arrival and delay time. I worry of not accurately speaking correctly about what the data says and incorrectly calculating what I am trying to prove. Trying to avoid this it is best to take time and go over what I have written and getting a seconded opinion.

Transform and clean

airline_data <- read.csv("C:/Users/typem/Documents/GitHub/Data607_Assignment5A/airline_data.csv")
airline_data
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
## Warning: package 'stringr' was built under R version 4.4.3
flight_wide <-na.omit(airline_data) %>% 
rename(AirLine = Airline, Arrival = Status) %>%
  mutate(AirLine = na_if(AirLine, '')) %>%
  fill(AirLine, .direction = 'down') %>%
  gather('Location', 'Flight.Cnt', 3:7) %>%
  mutate(Flight.Cnt = as.integer(str_replace(Flight.Cnt, ',', '')),
         Arrival = str_replace(Arrival, 'on time', 'On.Time'),
         Arrival = str_replace(Arrival, 'delayed', 'Delayed')) %>%
  spread(Arrival, Flight.Cnt)
 
flight_wide
##    AirLine      Location Delayed On.Time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Francisco     129     320
## 10 AM WEST       Seattle      61     201

Analysis

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(scales)
## Warning: package 'scales' was built under R version 4.4.3

Summary & Comparison by Airline

AirLine_summary <- flight_wide %>%
  group_by(AirLine) %>%
  summarise(On.Time = sum(On.Time), Delayed = sum(Delayed)) %>%
  mutate(Total = On.Time + Delayed,
         Delay.Rate = Delayed / Total,
         On.Time.Rate = On.Time / Total)
AirLine_summary
## # A tibble: 2 × 6
##   AirLine On.Time Delayed Total Delay.Rate On.Time.Rate
##   <chr>     <int>   <int> <int>      <dbl>        <dbl>
## 1 ALASKA     3274     501  3775      0.133        0.867
## 2 AM WEST    6438     787  7225      0.109        0.891
countOT1 <- ggplot(data = AirLine_summary, aes(x = AirLine, y = On.Time, fill = AirLine)) +
  geom_bar(stat = 'identity') +
  geom_text(aes(label = On.Time), vjust = -0.5) +
  ggtitle('On Time by Airline') +
  xlab('Airline') + ylab('On Time')

countDel1 <- ggplot(data = AirLine_summary, aes(x = AirLine, y = Delayed, fill = AirLine)) +
  geom_bar(stat = 'identity') +
  geom_text(aes(label = Delayed), vjust = -0.5) +
  ggtitle('Delay by Airline') +
  xlab('Airline') + ylab('Delayed')

countOT1

countDel1

The counts of On time flights is significantly higher than delayed by a couple of thousand. AM West seems to have more delays but also more on times since AM West has more total flight over all than Alaska we need to calculate and run the percentage on arrivals and delays.

RateOT <- ggplot(data = AirLine_summary, aes(x = AirLine, y = On.Time.Rate, fill = AirLine)) +
  geom_bar(stat = 'identity') +
  geom_text(aes(label = scales::percent(On.Time.Rate, accuracy = 0.01)), vjust = -0.5) +
  ggtitle('On Time by Airline') +
  xlab('Airline') + ylab('On Time')

RateDel <- ggplot(data = AirLine_summary, aes(x = AirLine, y = Delay.Rate, fill = AirLine)) +
  geom_bar(stat = 'identity') +
  geom_text(aes(label = scales::percent(Delay.Rate, accuracy = 0.01)), vjust = -0.5) +
  ggtitle('Delay by Airline') +
  xlab('Airline') + ylab('Delayed')

RateOT 

RateDel

Now with this percentage graph we can see the rates of both delayed and on time. Since these rates can tell us who is more delay or on time on average the totals should not skew our thoughts on which Airline is more reliable. We see that AM West the airline with the most delays actually performs better than Alaska because it is at a rate of about 11 percent while Alaska is at about 13 percent. Meaning that AM West has a lower rate of delays than Alaska and is more reliable to avoid delays.

Summary & Comparison by City

city_summary <- flight_wide %>%
  group_by(AirLine, Location) %>%
  summarise(
    On.Time = sum(On.Time),
    Delayed = sum(Delayed),
    Total = On.Time + Delayed,
    Delay.Rate = round(Delayed / Total, 4),
    On.Time.Rate = round(On.Time / Total, 4)
  ) %>%
  arrange(Location, AirLine)
## `summarise()` has grouped output by 'AirLine'. You can override using the
## `.groups` argument.
city_summary
## # A tibble: 10 × 7
## # Groups:   AirLine [2]
##    AirLine Location      On.Time Delayed Total Delay.Rate On.Time.Rate
##    <chr>   <chr>           <int>   <int> <int>      <dbl>        <dbl>
##  1 ALASKA  Los.Angeles       497      62   559     0.111         0.889
##  2 AM WEST Los.Angeles       694     117   811     0.144         0.856
##  3 ALASKA  Phoenix           221      12   233     0.0515        0.948
##  4 AM WEST Phoenix          4840     415  5255     0.079         0.921
##  5 ALASKA  San.Diego         212      20   232     0.0862        0.914
##  6 AM WEST San.Diego         383      65   448     0.145         0.855
##  7 ALASKA  San.Francisco     503     102   605     0.169         0.831
##  8 AM WEST San.Francisco     320     129   449     0.287         0.713
##  9 ALASKA  Seattle          1841     305  2146     0.142         0.858
## 10 AM WEST Seattle           201      61   262     0.233         0.767
countOT2 <- ggplot(data = city_summary, aes(x = AirLine, y = On.Time, fill = AirLine))+
  geom_bar(stat = 'identity')+
  facet_grid(~Location)+
  ggtitle('Total number of On Time by city')+
  geom_text(aes(label = On.Time), vjust = -0.5)+
    xlab('Airline') + ylab('On Time')

CountDel2 <- ggplot(data = city_summary, aes (x = AirLine, y = Delayed, fill = AirLine))+
  geom_bar(stat = 'identity')+
  facet_grid(~Location)+
  ggtitle('Total number of Delayed by city')+
  geom_text(aes(label = Delayed), vjust = -0.5)+
  xlab('AirLine') + ylab('On Time')

countOT2

CountDel2

AM West has the higher total number of on times compared to Alaska across 3 cites. AM West also has the most number of delays in four cities. These numbers are totals so they do not reflect rate but where these airlines are the most active in and how often they are delayed on on time. For example it is hard to compare their performance in Phoenix because Alaska has a total of 233 flights compared to AM West about 5,000 flights. We need rates.

RateOT2 <- ggplot(data = city_summary, aes(x = AirLine, y = On.Time.Rate, fill = AirLine)) +
  geom_bar(stat = 'identity') +
  facet_grid(~Location)+
  geom_text(aes(label = scales::percent(On.Time.Rate, accuracy = 0.01)), vjust = -0.5) +
  ggtitle('On Time by Airline') +
  xlab('Airline') + ylab('On Time')

RateDel2 <- ggplot(data = city_summary, aes(x = AirLine, y = Delay.Rate, fill = AirLine)) +
  geom_bar(stat = 'identity') +
  facet_grid(~Location)+
  geom_text(aes(label = scales::percent(Delay.Rate, accuracy = 0.01)), vjust = -0.5) +
  ggtitle('Delay by Airline') +
  xlab('Airline') + ylab('Delayed')

RateOT2

RateDel2

I am surprised to see the AM West does have a higher rate of delays across all cities compared to Alaska when the average across all cities was the opposite. Something notable that is a positive for AM West is that the location where they have the most flights, Phoenix has the lowest delay rate and could explain why AM West has a low delay rate on average. Nearly all cities where AM West has high delay rates it has a small amount of flights compared to Phoenix and Los Angeles. Alaska does actually have its higher delay rates in the cities that have the most flights and that is why its has a high flight delay overall.