Rathish Parayil Sasidharan

Overview

The chart above describes arrival delays for two airlines across five destinations.

Your task is to: (1) Create a .CSV file that includes all of the information above.
You’re encouraged to use a “wide” structure similar to how the information appears above,   so that you can practice tidying and transformations as described below.
(2) Read the information from your .CSV file into R, and use tidyr and dplyr as   needed to tidy and transform your data. 
(3) Perform analysis to compare the arrival delays for the two airlines.  (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data  cleanup work, analysis, and conclusions.
Please include in your homework submission:
The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page

Prerequisites

Load required packages

library(tidyr)
library(dplyr)
library(ggplot2)

Read the airline data

airlinesUntidyDf<-read.csv("https://raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/airlines.csv")
## Warning in read.table(file = file, header = header, sep = sep, quote
## = quote, : incomplete final line found by readTableHeader on 'https://
## raw.githubusercontent.com/rathish-ps/Data607-Assignment/main/airlines.csv'
airlinesUntidyDf
##   Airline Arrival.Status Los.Angeles Phoenix San.Diego San.Fransisco Seattle
## 1  ALASKA        on time         497     221       212           503    1841
## 2  ALASKA        delayed          62      12        20           102     305
## 3 AM WEST        on time         694    4840       383           320     201
## 4 AM WEST        delayed         117     415        65           129      61

Clean up the data

Transform the data from wide format to long format

#Move the destination columns (3 to 7) from the untidy data frame  into single destination column
#wide data into longer format and then widen the Arrival Status into separate columns

airlinesDf <- gather(airlinesUntidyDf,Destination, delays, 3:7) %>%spread(2,4)
airlinesDf
##    Airline   Destination delayed on time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Fransisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Fransisco     129     320
## 10 AM WEST       Seattle      61     201

Data transformation

Add additional columns for total , percentages for On time and delays

#Add additional columns for total flights , on time and delayed percentages 
airlinesTfDf <- airlinesDf %>% 
        mutate(total = delayed + `on time`,prcntOnTime = round(100 * `on time` / total, 2), prcntDelayed = round(delayed / total * 100, 2)) 


airlinesTfDf
##    Airline   Destination delayed on time total prcntOnTime prcntDelayed
## 1   ALASKA   Los.Angeles      62     497   559       88.91        11.09
## 2   ALASKA       Phoenix      12     221   233       94.85         5.15
## 3   ALASKA     San.Diego      20     212   232       91.38         8.62
## 4   ALASKA San.Fransisco     102     503   605       83.14        16.86
## 5   ALASKA       Seattle     305    1841  2146       85.79        14.21
## 6  AM WEST   Los.Angeles     117     694   811       85.57        14.43
## 7  AM WEST       Phoenix     415    4840  5255       92.10         7.90
## 8  AM WEST     San.Diego      65     383   448       85.49        14.51
## 9  AM WEST San.Fransisco     129     320   449       71.27        28.73
## 10 AM WEST       Seattle      61     201   262       76.72        23.28
#summary(airlinesTfDf)

#Create a summary record for each airline
 airlineSummaryDf <- airlinesTfDf %>% 
  group_by(Airline) %>% 
  summarise(totalDelayed = sum(delayed), totalOnTime = sum(`on time`), totalFlights = sum(total)) 
 
# Add additional columns for percentage - on-time and delayed flights 
 airlineSummaryDf<- airlineSummaryDf %>%  mutate(percentOntime = round(totalOnTime/totalFlights * 100, 2),percentDelayed = round(totalDelayed/totalFlights * 100, 2))
airlineSummaryDf
## # A tibble: 2 x 6
##   Airline totalDelayed totalOnTime totalFlights percentOntime percentDelayed
##   <chr>          <int>       <int>        <int>         <dbl>          <dbl>
## 1 ALASKA           501        3274         3775          86.7           13.3
## 2 AM WEST          787        6438         7225          89.1           10.9

Data visualisation

Bar plot for each destination by Airline

#bar plot for ontime flights
plt <- ggplot(airlinesTfDf, aes(y=prcntOnTime , x = Airline, color = Airline, fill = Airline)) + 
    geom_bar( stat = "identity") +
    geom_text(aes(label = prcntOnTime ), vjust = -.35) +
     ylim(0, 95) +
    facet_wrap(~Destination) +
    ylab("Percentage - On time Flights") +
    ggtitle("Percentage of On time Flights / Airline / Destination")
    
plt

# bar plot for delayed flights
plt1 <- ggplot(airlinesTfDf, aes(y=prcntDelayed, x = Airline, color = Airline, fill = Airline)) + 
    geom_bar( stat = "identity") +
    geom_text(aes(label = prcntDelayed), vjust = -.35) +
    ylim(0, 30) +
    facet_wrap(~Destination) +
    ylab("Percentage - Delayed Flights") +
    ggtitle("Percentage of Delayed Flights / Airline / Destination")
    
plt1

Summary

It is clear from the summary data that ‘ALASKA’ airline has more percentage delay than ‘AM WEST’ airline ,even though ‘AM WEST’ has more percentage delay for each destination. This is mostly because of the number of flights operated by each airline to different destinations are different.