This purpose of this assignment was to mimic the “wide” dataset provided by creating a .CSV file, reading it into R, clean up the data then perform analysis.
library(tidyr)
library(dplyr)
library(reshape2)
library(ggplot2)
library(knitr)
library(kableExtra)
#read file into R
fileloc = "https://raw.githubusercontent.com/javernw/JWCUNYAssignments/master/flights.csv"
flights_df <- read.csv(fileloc, sep = ",", stringsAsFactors = F)
flights_df %>%
kable() %>%
kable_styling()
ï.. | X | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
NA | NA | NA | NA | NA | ||
AMWEST | on time | 694 | 4840 | 383 | 320 | 201 |
AMWEST | delayed | 117 | 415 | 65 | 129 | 61 |
#rename first 2 columns
flights_df <- rename(flights_df, "Airline" = `ï..`, "Status" = `X`)
flights_df%>%
kable() %>%
kable_styling()
Airline | Status | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
NA | NA | NA | NA | NA | ||
AMWEST | on time | 694 | 4840 | 383 | 320 | 201 |
AMWEST | delayed | 117 | 415 | 65 | 129 | 61 |
tidyflights_df <- flights_df %>%
gather("City", "Count", 3:7, na.rm = T) %>% #pushes data in column into rows
spread(Status, Count) %>%
rename("Delayed" = `delayed`, "OnTime" = `on time`)
tidyflights_df %>%
kable() %>%
kable_styling()
Airline | City | Delayed | OnTime |
---|---|---|---|
ALASKA | Los.Angeles | 62 | 497 |
ALASKA | Phoenix | 12 | 221 |
ALASKA | San.Diego | 20 | 212 |
ALASKA | San.Francisco | 102 | 503 |
ALASKA | Seattle | 305 | 1841 |
AMWEST | Los.Angeles | 117 | 694 |
AMWEST | Phoenix | 415 | 4840 |
AMWEST | San.Diego | 65 | 383 |
AMWEST | San.Francisco | 129 | 320 |
AMWEST | Seattle | 61 | 201 |
tidyflights_df <- tidyflights_df %>% mutate(Ratio = round((`OnTime` / `Delayed`),2), "%Success" = round(`OnTime` / (`Delayed` + `OnTime`), 2))
## Warning: package 'bindrcpp' was built under R version 3.5.2
tidyflights_df %>%
kable() %>%
kable_styling()
Airline | City | Delayed | OnTime | Ratio | %Success |
---|---|---|---|---|---|
ALASKA | Los.Angeles | 62 | 497 | 8.02 | 0.89 |
ALASKA | Phoenix | 12 | 221 | 18.42 | 0.95 |
ALASKA | San.Diego | 20 | 212 | 10.60 | 0.91 |
ALASKA | San.Francisco | 102 | 503 | 4.93 | 0.83 |
ALASKA | Seattle | 305 | 1841 | 6.04 | 0.86 |
AMWEST | Los.Angeles | 117 | 694 | 5.93 | 0.86 |
AMWEST | Phoenix | 415 | 4840 | 11.66 | 0.92 |
AMWEST | San.Diego | 65 | 383 | 5.89 | 0.85 |
AMWEST | San.Francisco | 129 | 320 | 2.48 | 0.71 |
AMWEST | Seattle | 61 | 201 | 3.30 | 0.77 |
Ratio -> Ontime:Delayed. Let’s look at ALASKA Airline in Phoenix. For every 18
flights that is on time, there is 1
that’s delayed which is very good. Based on the success rate, Alaska airline has a 95%
record of being on time in Phoenix that is.
# Delays in each city by the two airlines
tidyflights_df %>%
arrange(City) %>%
select(Airline, City, Delayed) %>%
group_by(Airline) %>%
spread(Airline, Delayed) %>%
kable() %>%
kable_styling() %>% add_header_above(c(" ", "Delays"=2))
City | ALASKA | AMWEST |
---|---|---|
Los.Angeles | 62 | 117 |
Phoenix | 12 | 415 |
San.Diego | 20 | 65 |
San.Francisco | 102 | 129 |
Seattle | 305 | 61 |
# % Delay for Alaska and Amwest
tidyflights_df %>%
group_by(Airline) %>%
summarise(DelayRate = sum(`Delayed`) / (sum(`Delayed`) + sum(`OnTime`))) %>%
spread(Airline, DelayRate) %>%
kable() %>%
kable_styling()
ALASKA | AMWEST |
---|---|
0.1327152 | 0.1089273 |
AMWEST has higher count in delays on a whole but lower delay rate.
alaska <- filter(tidyflights_df, Airline == "ALASKA") #alaska df
summary(alaska$`%Success`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.830 0.860 0.890 0.888 0.910 0.950
amwest <- filter(tidyflights_df, Airline == "AMWEST") #amwest df
summary(amwest$`%Success`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.710 0.770 0.850 0.822 0.860 0.920
Summary of success (on time) for each flight. Alaska shows better progress.
# line graph showing the delay ratios between alaska and amwest
ggplot(tidyflights_df, aes(x=City, y=Ratio, group = Airline)) + geom_line(aes(color=Airline)) +geom_point(aes(color=Airline))
Most delays occured in San Francisco with both airlines. A few factors can contribute to this such as weather or aviation system delay at that city’s airport.
Which Airline is better?
In the analysis, ALASKA
has better on time arrival rates with an average of 88%
, while AMWEST
shows a lower delay rate than ALASKA
; 11%
and 13%
respectively. AMWEST
did more flights than ALASKA
which means more chances of delays but as we know, when the population size increase, there is little difference in the outcome. However, more flights does not mean it’s better but can be that it has cheaper flights or more accomodating than the competition. It’s hard to tell which flight is better but with other factors remaining constant, AMWEST
is better just because they have a lower delay rate and this assignment is focused on the analysis of the arrival delays between the two flights.