The chart above describes arrival delays for two airlines across five destinations.
Your task is to: (1) Create a .CSV file that includes all of the information above.
You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
(2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
(3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission:
The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page
Load required packages
library(tidyr)
library(dplyr)
library(ggplot2)
airlinesUntidyDf<-read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/airlines.csv")
## Warning in read.table(file = file, header = header, sep = sep, quote
## = quote, : incomplete final line found by readTableHeader on 'https://
## raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/airlines.csv'
airlinesUntidyDf
## Airline Arrival.Status Los.Angeles Phoenix San.Diego San.Fransisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Transform the data from wide format to long format
#Move the destination columns (3 to 7) from the untidy data frame into single destination column
#wide data into longer format and then widen the Arrival Status into separate columns
airlinesDf <- gather(airlinesUntidyDf,Destination, delays, 3:7) %>%spread(2,4)
airlinesDf
## Airline Destination delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Fransisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Fransisco 129 320
## 10 AM WEST Seattle 61 201
Add additional columns for total , percentages for On time and delays
#Add additional columns for total flights , on time and delayed percentages
airlinesTfDf <- airlinesDf %>%
mutate(total = delayed + `on time`,prcntOnTime = round(100 * `on time` / total, 2), prcntDelayed = round(delayed / total * 100, 2))
airlinesTfDf
## Airline Destination delayed on time total prcntOnTime prcntDelayed
## 1 ALASKA Los.Angeles 62 497 559 88.91 11.09
## 2 ALASKA Phoenix 12 221 233 94.85 5.15
## 3 ALASKA San.Diego 20 212 232 91.38 8.62
## 4 ALASKA San.Fransisco 102 503 605 83.14 16.86
## 5 ALASKA Seattle 305 1841 2146 85.79 14.21
## 6 AM WEST Los.Angeles 117 694 811 85.57 14.43
## 7 AM WEST Phoenix 415 4840 5255 92.10 7.90
## 8 AM WEST San.Diego 65 383 448 85.49 14.51
## 9 AM WEST San.Fransisco 129 320 449 71.27 28.73
## 10 AM WEST Seattle 61 201 262 76.72 23.28
#summary(airlinesTfDf)
#Create a summary record for each airline
airlineSummaryDf <- airlinesTfDf %>%
group_by(Airline) %>%
summarise(totalDelayed = sum(delayed), totalOnTime = sum(`on time`), totalFlights = sum(total))
# Add additional columns for percentage - on-time and delayed flights
airlineSummaryDf<- airlineSummaryDf %>% mutate(percentOntime = round(totalOnTime/totalFlights * 100, 2),percentDelayed = round(totalDelayed/totalFlights * 100, 2))
airlineSummaryDf
## # A tibble: 2 x 6
## Airline totalDelayed totalOnTime totalFlights percentOntime percentDelayed
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 ALASKA 501 3274 3775 86.7 13.3
## 2 AM WEST 787 6438 7225 89.1 10.9
Bar plot for each destination by Airline
#bar plot for ontime flights
plt <- ggplot(airlinesTfDf, aes(y=prcntOnTime , x = Airline, color = Airline, fill = Airline)) +
geom_bar( stat = "identity") +
geom_text(aes(label = prcntOnTime ), vjust = -.35) +
ylim(0, 95) +
facet_wrap(~Destination) +
ylab("Percentage - On time Flights") +
ggtitle("Percentage of On time Flights / Airline / Destination")
plt
# bar plot for delayed flights
plt1 <- ggplot(airlinesTfDf, aes(y=prcntDelayed, x = Airline, color = Airline, fill = Airline)) +
geom_bar( stat = "identity") +
geom_text(aes(label = prcntDelayed), vjust = -.35) +
ylim(0, 30) +
facet_wrap(~Destination) +
ylab("Percentage - Delayed Flights") +
ggtitle("Percentage of Delayed Flights / Airline / Destination")
plt1
It is clear from the summary data that ‘ALASKA’ airline has more percentage delay than ‘AM WEST’ airline ,even though ‘AM WEST’ has more percentage delay for each destination. This is mostly because of the number of flights operated by each airline to different destinations are different.