Read Data

The table is stored in a csv on my github accounts. Using “read_csv” to read data into a dataframe.

Narrow Dataset

It will be difficult to extract the flight counts when their stored in different columns. Here I use the gather function to narrow the data set and store all of the flight counts in one column.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
flights_narrow <- flights %>%
  gather(City, Flights, 3:7)

head(flights_narrow)
##   Airline  Status        City Flights
## 1  ALASKA on time Los.Angeles     497
## 2  ALASKA delayed Los.Angeles      62
## 3 AM WEST on time Los.Angeles     694
## 4 AM WEST delayed Los.Angeles     117
## 5  ALASKA on time     Pheonix     221
## 6  ALASKA delayed     Pheonix      12

Summarise

To discover which airline is more punctional we use a couple functions from the dplyr package. Namely mutate, to single out on time flights, and summarize to calculate the on time rate.

flights_narrow %>%
  mutate(on_time = ifelse(Status=='on time', Flights, 0)) %>%
  group_by(Airline) %>%
  summarise(
    on_time_rate = sum(on_time) / sum(Flights)
  )
## # A tibble: 2 x 2
##   Airline on_time_rate
## * <chr>          <dbl>
## 1 ALASKA         0.867
## 2 AM WEST        0.891

Here we can see AM WEST is slightly better than Alaska w/ being on time. Let’s investigate further and see if there are any stand out cities.

flights_narrow %>%
  mutate(on_time = ifelse(Status=='on time', Flights, 0)) %>%
  group_by(City) %>%
  summarise(
    on_time_rate = sum(on_time) / sum(Flights)
  )
## # A tibble: 5 x 2
##   City          on_time_rate
## * <chr>                <dbl>
## 1 Los.Angeles          0.869
## 2 Pheonix              0.922
## 3 San.Deigo            0.875
## 4 San.Francisco        0.781
## 5 Seattle              0.848

Interesting. San Francisco performs the worst in terms of on time flights.

flights_narrow %>%
  mutate(sf_flights = ifelse(City=='San.Francisco', Flights, 0)) %>%
  group_by(Airline) %>%
  summarise(
    sf_flights = sum(sf_flights) / sum(Flights)
  )
## # A tibble: 2 x 2
##   Airline sf_flights
## * <chr>        <dbl>
## 1 ALASKA      0.160 
## 2 AM WEST     0.0621

A larger share of Alaska’s flights are flying into SF. This could be the reason it has a lower proportion of flights being delayed. Not because it’s an inferior Airline.