Assignment 5

Assignment - Tidying and Transforming Data

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions

#reading .csv file
airlines = read.csv('airlines.csv', header = TRUE, sep = ",")

#eliminating empty row
airlines=airlines[!(rowSums(is.na(airlines))),]

#Populating empty airline cells
airlines[2,1] = "ALASKA"
airlines[4,1] = "AM WEST"

#tidying things up
airlines1 = gather(airlines, city, freq, Los.Angeles:Seattle)
airlines2 = spread(airlines1, X.1, freq)
names(airlines2)[1] = 'airlines'
tbl_df(airlines2)

## # A tibble: 10 x 4
##    airlines          city delayed on time
## *    <fctr>         <chr>   <int>   <int>
## 1    ALASKA   Los.Angeles      62     497
## 2    ALASKA       Phoenix      12     221
## 3    ALASKA     San.Diego      20     212
## 4    ALASKA San.Francisco     102     503
## 5    ALASKA       Seattle     305    1841
## 6   AM WEST   Los.Angeles     117     694
## 7   AM WEST       Phoenix     415    4840
## 8   AM WEST     San.Diego      65     383
## 9   AM WEST San.Francisco     129     320
## 10  AM WEST       Seattle      61     201

Calculating the ratio of delayed flights

#Calculating and adding a column with delay proportions per destination
finaldata = mutate(airlines2, delayratio = round(delayed/(delayed+`on time`),digits=2))
tbl_df(finaldata)

## # A tibble: 10 x 5
##    airlines          city delayed on time delayratio
##      <fctr>         <chr>   <int>   <int>      <dbl>
## 1    ALASKA   Los.Angeles      62     497       0.11
## 2    ALASKA       Phoenix      12     221       0.05
## 3    ALASKA     San.Diego      20     212       0.09
## 4    ALASKA San.Francisco     102     503       0.17
## 5    ALASKA       Seattle     305    1841       0.14
## 6   AM WEST   Los.Angeles     117     694       0.14
## 7   AM WEST       Phoenix     415    4840       0.08
## 8   AM WEST     San.Diego      65     383       0.15
## 9   AM WEST San.Francisco     129     320       0.29
## 10  AM WEST       Seattle      61     201       0.23

#Calculating total delay proportions per airline
finaldata %>% group_by(airlines) %>% summarise(Totaldelayed = sum(delayed), Totalontime=sum(`on time`)) %>% mutate(TdelayRatio = round(Totaldelayed/(Totaldelayed+Totalontime),digits=2))

## # A tibble: 2 x 4
##   airlines Totaldelayed Totalontime TdelayRatio
##     <fctr>        <int>       <int>       <dbl>
## 1   ALASKA          501        3274        0.13
## 2  AM WEST          787        6438        0.11

Plotting the delay ratios for both airlines

p=ggplot(finaldata,aes(x=city,y=delayratio,fill=airlines))+geom_bar(stat="identity",position="dodge")
ggplotly(p)

Conclusion

If we compare the ratio of the total number of delayed flights, we can see that AM WEST has 11% of delays vs 13% by ALASKA airline. But if we take a look at a plot of individual locations, we see that ALASKA actually has a lower delay for each of the destinations. It is puzzling at first but if we take a look at Pheonix destination, we see that it actually makes up for over 75% (4840/6438) of all their flights and has a ratio of 0.08 which is below their average ratio of 0.11. In fact other destinations have much higher delay ratios then then the average 0.11. So, it appears that the Phoenex location skews the average and pushes overal number below the average delays of ALASKA airlines. The wikipida page (https://en.wikipedia.org/wiki/America_West_Airlines) mentions that Phoenex is the main hub for AM West and that potentially explains the relitevely low delay rate and the relatively high number of overal flights.

So, in conclusion to make a decision based on the data we have, we can conclude that ALASKA airlines is a better choice for traveling to the above mentioned destinations since ALASKA airlines has lower delay ratios for every single destination.

Assignment 5

Joseph Elikishvili

October 2, 2016

Assignment - Tidying and Transforming Data

Calculating the ratio of delayed flights

Plotting the delay ratios for both airlines

Conclusion