Data607-Assignment2

Overview

The purpose of the project is the following:
1. Read in flight data
2. Tidy the data
3. Compare flight delays across the airlines: AM WEST and ALASKA

As stated by Hadley Wickham on https://r4ds.had.co.nz/tidy-data.html, we’re going to make this data tidy by enforcing the following rules:

There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Necessary libraries:

tidyverse
dplyr
ggplot2
kableExtra

Retrieve Data

We’re retrieivng the data in the form of a csv from GitHub. The data is in an untidy format with several empty lines, columns missing names, rows with multiple observations, etc.

flights <- read_csv("https://raw.githubusercontent.com/devinteran/Data607-Assignment2/master/Flights.csv")

## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   `Los Angeles` = col_double(),
##   Phoenix = col_double(),
##   `San Diego` = col_double(),
##   `San Francisco` = col_double(),
##   Seattle = col_double()
## )

kable(flights) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

X1	X2	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
NA	NA	NA	NA	NA	NA	NA
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

Cleaning Data

First I will remove empty rows of NA and add airline names where they are implied.

#Remove empty row 2 and 5
flights <- flights[-c(1,4),]

#Add data values which are currently missing but are implied
flights[2,1] <- "ALASKA"
flights[4,1] <- "AM WEST"

#Rename columns
colnames(flights) <- c("Airline","Status","Los Angeles","Phoenix","San Diego","San Francisco","Seattle")

Reshape the Data

Wide to Long

Right now the data is very wide. We’re going to perform operations to make each city into an individual row.

Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

flights <- flights %>% pivot_longer(c("Los Angeles","Phoenix","San Diego","San Francisco","Seattle"),names_to = "City",values_to="Count")

Reducing Length

As you can see below, each city has two rows per airline. One for the on time status and another for the delayed status. It will be easier if on time and delayed counts had their own column.

Airline	Status	City	Count
ALASKA	on time	Los Angeles	497
ALASKA	delayed	Los Angeles	62
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12
ALASKA	on time	San Diego	212
ALASKA	delayed	San Diego	20

To do this will be adding two columns using the pivot_wider() function.

flights <- flights %>% pivot_wider(names_from="Status",values_from="Count")

Final Clean Data View

Here is our data in a clean format. Now we’re ready for analysis!

flights_per_city_per_airline <- flights %>% mutate(`Total Flights` = `on time` + delayed)

kable(flights_per_city_per_airline) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Airline	City	on time	delayed	Total Flights
ALASKA	Los Angeles	497	62	559
ALASKA	Phoenix	221	12	233
ALASKA	San Diego	212	20	232
ALASKA	San Francisco	503	102	605
ALASKA	Seattle	1841	305	2146
AM WEST	Los Angeles	694	117	811
AM WEST	Phoenix	4840	415	5255
AM WEST	San Diego	383	65	448
AM WEST	San Francisco	320	129	449
AM WEST	Seattle	201	61	262

Let’s look at total flights per airline per route

The most popular route for AM West is Phoenix and the most popular route for ALASKA is Seattle. In general, AM WEST looks to have more flights in most cities compared to ALASKA.

ggplot(flights_per_city_per_airline,aes(fill=Airline,x=reorder(City,`Total Flights`),y=`Total Flights`)) +
  geom_bar(position="dodge", stat="identity") +
  coord_flip()

What percent of flights per airline per route are delayed?

Intersting findings include:
* AM WEST has a lot of routes and it also has more delays compared to ALASKA.
* San Francisco has the most delays for both airlines.
* Phoenix has the most routes for AM WEST but the lowest % delays

flights_perct <- flights_per_city_per_airline %>% 
    mutate(`Percent Flights Delayed` = delayed/`Total Flights`,
           `Percent Flights On Time` = `on time` / `Total Flights`)

ggplot(flights_perct,aes(fill=Airline,x=reorder(City,-`Percent Flights Delayed`),y=`Percent Flights Delayed`)) + 
  geom_bar(position="dodge", stat="identity") +
  coord_flip()

Do flight routes with more flights have more or fewer delays?

The most popular route for AM West, Phoenix, has the lowest percent delays
AM West has more delays than ALASKA when comparing city to city routes

ggplot(flights_perct,aes(fill=City,x=`Total Flights`,label=City,y=`Percent Flights Delayed`,shape=City,color=Airline,size=4)) + 
  geom_point() +
  guides(size=FALSE)

Conclusion

It is clear that although the airline, AM WEST, has more flights, they are delayed more often than the airline ALASKA. The most common route for AM WEST to Phoenix has the lowest % delays. With more data from more airlines it would be interesting to see if the correlation of increased number of flights and fewer delays holds. I would recommend flying ALASKA if possible to avoid delays. It would be interesting to be able to bring cost into this analysis.