This purpose of this assignment was to mimic the “wide” dataset provided by creating a .CSV file, reading it into R, clean up the data then perform analysis.

library(tidyr)
library(dplyr)
library(reshape2)
library(ggplot2)
library(knitr)
library(kableExtra)

Untidy Data

#read file into R
fileloc = "https://raw.githubusercontent.com/javernw/JWCUNYAssignments/master/flights.csv"
flights_df <- read.csv(fileloc, sep = ",", stringsAsFactors = F)
flights_df %>%
  kable() %>%
  kable_styling()
ï.. X Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
NA NA NA NA NA
AMWEST on time 694 4840 383 320 201
AMWEST delayed 117 415 65 129 61
#rename first 2 columns
flights_df <- rename(flights_df, "Airline" = `ï..`, "Status" = `X`)
flights_df%>%
  kable() %>%
  kable_styling()
Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
NA NA NA NA NA
AMWEST on time 694 4840 383 320 201
AMWEST delayed 117 415 65 129 61

Tidy Data

tidyflights_df <- flights_df %>% 
  gather("City", "Count", 3:7, na.rm = T) %>% #pushes data in column into rows
  spread(Status, Count) %>%  
  rename("Delayed" = `delayed`, "OnTime" = `on time`)  

tidyflights_df %>%
  kable() %>%
  kable_styling()
Airline City Delayed OnTime
ALASKA Los.Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San.Diego 20 212
ALASKA San.Francisco 102 503
ALASKA Seattle 305 1841
AMWEST Los.Angeles 117 694
AMWEST Phoenix 415 4840
AMWEST San.Diego 65 383
AMWEST San.Francisco 129 320
AMWEST Seattle 61 201

Analysis

tidyflights_df <- tidyflights_df %>% mutate(Ratio = round((`OnTime` / `Delayed`),2), "%Success" = round(`OnTime` / (`Delayed` + `OnTime`), 2))
## Warning: package 'bindrcpp' was built under R version 3.5.2
tidyflights_df %>%
  kable() %>%
  kable_styling()
Airline City Delayed OnTime Ratio %Success
ALASKA Los.Angeles 62 497 8.02 0.89
ALASKA Phoenix 12 221 18.42 0.95
ALASKA San.Diego 20 212 10.60 0.91
ALASKA San.Francisco 102 503 4.93 0.83
ALASKA Seattle 305 1841 6.04 0.86
AMWEST Los.Angeles 117 694 5.93 0.86
AMWEST Phoenix 415 4840 11.66 0.92
AMWEST San.Diego 65 383 5.89 0.85
AMWEST San.Francisco 129 320 2.48 0.71
AMWEST Seattle 61 201 3.30 0.77

Ratio -> Ontime:Delayed. Let’s look at ALASKA Airline in Phoenix. For every 18 flights that is on time, there is 1 that’s delayed which is very good. Based on the success rate, Alaska airline has a 95% record of being on time in Phoenix that is.

# Delays in each city by the two airlines
tidyflights_df %>% 
  arrange(City) %>% 
  select(Airline, City, Delayed) %>% 
  group_by(Airline) %>% 
  spread(Airline, Delayed) %>%
  kable() %>%
  kable_styling() %>% add_header_above(c(" ", "Delays"=2))
Delays
City ALASKA AMWEST
Los.Angeles 62 117
Phoenix 12 415
San.Diego 20 65
San.Francisco 102 129
Seattle 305 61
# % Delay for Alaska and Amwest 
tidyflights_df %>% 
  group_by(Airline) %>% 
  summarise(DelayRate = sum(`Delayed`) / (sum(`Delayed`) + sum(`OnTime`))) %>%
  spread(Airline, DelayRate) %>%
  kable() %>%
  kable_styling()
ALASKA AMWEST
0.1327152 0.1089273

AMWEST has higher count in delays on a whole but lower delay rate.

alaska <- filter(tidyflights_df, Airline == "ALASKA") #alaska df
summary(alaska$`%Success`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.830   0.860   0.890   0.888   0.910   0.950
amwest <- filter(tidyflights_df, Airline == "AMWEST") #amwest df
summary(amwest$`%Success`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.710   0.770   0.850   0.822   0.860   0.920

Summary of success (on time) for each flight. Alaska shows better progress.

# line graph showing the delay ratios between alaska and amwest
ggplot(tidyflights_df, aes(x=City, y=Ratio, group = Airline)) + geom_line(aes(color=Airline)) +geom_point(aes(color=Airline))


Most delays occured in San Francisco with both airlines. A few factors can contribute to this such as weather or aviation system delay at that city’s airport.

Conclusion

Which Airline is better?

In the analysis, ALASKA has better on time arrival rates with an average of 88%, while AMWEST shows a lower delay rate than ALASKA; 11% and 13% respectively. AMWEST did more flights than ALASKA which means more chances of delays but as we know, when the population size increase, there is little difference in the outcome. However, more flights does not mean it’s better but can be that it has cheaper flights or more accomodating than the competition. It’s hard to tell which flight is better but with other factors remaining constant, AMWEST is better just because they have a lower delay rate and this assignment is focused on the analysis of the arrival delays between the two flights.