Overview

The purpose of the project is the following:
1. Read in flight data
2. Tidy the data
3. Compare flight delays across the airlines: AM WEST and ALASKA

As stated by Hadley Wickham on https://r4ds.had.co.nz/tidy-data.html, we’re going to make this data tidy by enforcing the following rules:

There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Necessary libraries:

Retrieve Data

We’re retrieivng the data in the form of a csv from GitHub. The data is in an untidy format with several empty lines, columns missing names, rows with multiple observations, etc.

flights <- read_csv("https://raw.githubusercontent.com/devinteran/Data607-Assignment2/master/Flights.csv")
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   `Los Angeles` = col_double(),
##   Phoenix = col_double(),
##   `San Diego` = col_double(),
##   `San Francisco` = col_double(),
##   Seattle = col_double()
## )
kable(flights) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
X1 X2 Los Angeles Phoenix San Diego San Francisco Seattle
NA NA NA NA NA NA NA
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

Cleaning Data

First I will remove empty rows of NA and add airline names where they are implied.

#Remove empty row 2 and 5
flights <- flights[-c(1,4),]

#Add data values which are currently missing but are implied
flights[2,1] <- "ALASKA"
flights[4,1] <- "AM WEST"

#Rename columns
colnames(flights) <- c("Airline","Status","Los Angeles","Phoenix","San Diego","San Francisco","Seattle")

Reshape the Data

Wide to Long

Right now the data is very wide. We’re going to perform operations to make each city into an individual row.
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61
flights <- flights %>% pivot_longer(c("Los Angeles","Phoenix","San Diego","San Francisco","Seattle"),names_to = "City",values_to="Count")

Reducing Length

As you can see below, each city has two rows per airline. One for the on time status and another for the delayed status. It will be easier if on time and delayed counts had their own column.
Airline Status City Count
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
ALASKA on time San Diego 212
ALASKA delayed San Diego 20

To do this will be adding two columns using the pivot_wider() function.

flights <- flights %>% pivot_wider(names_from="Status",values_from="Count")

Final Clean Data View

Here is our data in a clean format. Now we’re ready for analysis!

flights_per_city_per_airline <- flights %>% mutate(`Total Flights` = `on time` + delayed)

kable(flights_per_city_per_airline) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Airline City on time delayed Total Flights
ALASKA Los Angeles 497 62 559
ALASKA Phoenix 221 12 233
ALASKA San Diego 212 20 232
ALASKA San Francisco 503 102 605
ALASKA Seattle 1841 305 2146
AM WEST Los Angeles 694 117 811
AM WEST Phoenix 4840 415 5255
AM WEST San Diego 383 65 448
AM WEST San Francisco 320 129 449
AM WEST Seattle 201 61 262

Let’s look at total flights per airline per route

The most popular route for AM West is Phoenix and the most popular route for ALASKA is Seattle. In general, AM WEST looks to have more flights in most cities compared to ALASKA.

ggplot(flights_per_city_per_airline,aes(fill=Airline,x=reorder(City,`Total Flights`),y=`Total Flights`)) +
  geom_bar(position="dodge", stat="identity") +
  coord_flip()

What percent of flights per airline per route are delayed?

Intersting findings include:
* AM WEST has a lot of routes and it also has more delays compared to ALASKA.
* San Francisco has the most delays for both airlines.
* Phoenix has the most routes for AM WEST but the lowest % delays

flights_perct <- flights_per_city_per_airline %>% 
    mutate(`Percent Flights Delayed` = delayed/`Total Flights`,
           `Percent Flights On Time` = `on time` / `Total Flights`)

ggplot(flights_perct,aes(fill=Airline,x=reorder(City,-`Percent Flights Delayed`),y=`Percent Flights Delayed`)) + 
  geom_bar(position="dodge", stat="identity") +
  coord_flip()

Do flight routes with more flights have more or fewer delays?

ggplot(flights_perct,aes(fill=City,x=`Total Flights`,label=City,y=`Percent Flights Delayed`,shape=City,color=Airline,size=4)) + 
  geom_point() +
  guides(size=FALSE)

Conclusion

It is clear that although the airline, AM WEST, has more flights, they are delayed more often than the airline ALASKA. The most common route for AM WEST to Phoenix has the lowest % delays. With more data from more airlines it would be interesting to see if the correlation of increased number of flights and fewer delays holds. I would recommend flying ALASKA if possible to avoid delays. It would be interesting to be able to bring cost into this analysis.