The chart above describes arrival delays for two airlines across five destinations. Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions
#reading .csv file
airlines = read.csv('airlines.csv', header = TRUE, sep = ",")
#eliminating empty row
airlines=airlines[!(rowSums(is.na(airlines))),]
#Populating empty airline cells
airlines[2,1] = "ALASKA"
airlines[4,1] = "AM WEST"
#tidying things up
airlines1 = gather(airlines, city, freq, Los.Angeles:Seattle)
airlines2 = spread(airlines1, X.1, freq)
names(airlines2)[1] = 'airlines'
tbl_df(airlines2)
## # A tibble: 10 x 4
## airlines city delayed on time
## * <fctr> <chr> <int> <int>
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
#Calculating and adding a column with delay proportions per destination
finaldata = mutate(airlines2, delayratio = round(delayed/(delayed+`on time`),digits=2))
tbl_df(finaldata)
## # A tibble: 10 x 5
## airlines city delayed on time delayratio
## <fctr> <chr> <int> <int> <dbl>
## 1 ALASKA Los.Angeles 62 497 0.11
## 2 ALASKA Phoenix 12 221 0.05
## 3 ALASKA San.Diego 20 212 0.09
## 4 ALASKA San.Francisco 102 503 0.17
## 5 ALASKA Seattle 305 1841 0.14
## 6 AM WEST Los.Angeles 117 694 0.14
## 7 AM WEST Phoenix 415 4840 0.08
## 8 AM WEST San.Diego 65 383 0.15
## 9 AM WEST San.Francisco 129 320 0.29
## 10 AM WEST Seattle 61 201 0.23
#Calculating total delay proportions per airline
finaldata %>% group_by(airlines) %>% summarise(Totaldelayed = sum(delayed), Totalontime=sum(`on time`)) %>% mutate(TdelayRatio = round(Totaldelayed/(Totaldelayed+Totalontime),digits=2))
## # A tibble: 2 x 4
## airlines Totaldelayed Totalontime TdelayRatio
## <fctr> <int> <int> <dbl>
## 1 ALASKA 501 3274 0.13
## 2 AM WEST 787 6438 0.11
p=ggplot(finaldata,aes(x=city,y=delayratio,fill=airlines))+geom_bar(stat="identity",position="dodge")
ggplotly(p)
If we compare the ratio of the total number of delayed flights, we can see that AM WEST has 11% of delays vs 13% by ALASKA airline. But if we take a look at a plot of individual locations, we see that ALASKA actually has a lower delay for each of the destinations. It is puzzling at first but if we take a look at Pheonix destination, we see that it actually makes up for over 75% (4840/6438) of all their flights and has a ratio of 0.08 which is below their average ratio of 0.11. In fact other destinations have much higher delay ratios then then the average 0.11. So, it appears that the Phoenex location skews the average and pushes overal number below the average delays of ALASKA airlines. The wikipida page (https://en.wikipedia.org/wiki/America_West_Airlines) mentions that Phoenex is the main hub for AM West and that potentially explains the relitevely low delay rate and the relatively high number of overal flights.
So, in conclusion to make a decision based on the data we have, we can conclude that ALASKA airlines is a better choice for traveling to the above mentioned destinations since ALASKA airlines has lower delay ratios for every single destination.