Introduction

   A sample flight arrival status data given in raw format, which needs to be cleaned, transformed for analysis and data insights. The following R libraries are used for data tidying and transformation

  • RCurl - Fetch the data from github repository
  • dplyr - Grammar of data manipulation, set of tools for efficiently manipulating datasets in R
  • tidyr - To tidy messy data, it provides three main functions gather(), separate() and spread().
  • ggplot2 - Visualization package for creating graphs and charts
  • Input File & Tidying

  • The input raw *.csv file read from github repository using RCurl library
  • rename function - dplyr package (used to rename the Column Name of X as Airline and X.1 as ArrivalStatus)
  • fill function - tidyr package (Used to fill the missing value, with additional argument ‘as.direction’ to mention up or down)
  • na.exclude - Generic function useful for dealing with ‘NA’ values
  • All the listed above functions are used to clean up the data in to regular table format data usable for further data analysis

  • library(RCurl)
    library(dplyr)
    library(tidyr)
    
    arrival_status<-read.csv("https://raw.githubusercontent.com/thasleem1/DATA607/master/arrival_data.csv",na.strings ="") %>% 
      rename(Airline = X, ArrivalStatus = X.1) %>% 
        fill(Airline, .direction = "down") %>%
         na.exclude()
    arrival_status

    Data Transformation

       The following tidyr package function used
  • gather - collects a set of column names and places them into a single “key” column (City)
  • spread - reverse of gather, rows to column (ArrivalStatus)

  •    The following dplyr package function used
  • mutate - add new column (finding the ratio of delay time and on time flight status)
  • join - inner join to join the mutate (new columns) in to single data frame

  • arrival_status_gather <- arrival_status %>% gather("City","ArrivalCount", 3:7)
    arrival_status_spread <- spread(arrival_status_gather,ArrivalStatus,ArrivalCount) 
    
    arrival_status_mutate_delay <- mutate(arrival_status_spread,delayratio = delayed/(`on time`+delayed)*100) 
    arrival_status_mutate_arrival <- mutate(arrival_status_spread,ontimeratio = `on time`/(`on time`+delayed)*100) 
    
    filght_status <- inner_join(arrival_status_mutate_delay, arrival_status_mutate_arrival, by = c('Airline','City','delayed','on time'))
    filght_status

    Comparison Analysis

       Graphical representation of flight delay and on-arrival time ratio

    library(ggplot2)
    
    ggplot(data=filght_status, aes(x=City, y=delayratio, fill=Airline)) +
      geom_bar(stat="identity", position=position_dodge())+
      geom_text(aes(label=format(round(delayratio, 2), nsmall = 2)), vjust=1.6, color="white",
                position = position_dodge(0.9), size=3.5)+
      scale_fill_brewer(palette="Set1")  + ggtitle("Delay Ratio") + ylab("delay") +
      theme(plot.title = element_text(hjust = 0.5))

    ggplot(data=filght_status, aes(x=City, y=ontimeratio, fill=Airline)) +
      geom_bar(stat="identity", position=position_dodge())+
      geom_text(aes(label=format(round(ontimeratio, 2), nsmall = 2)), vjust=1.6, color="white",
                position = position_dodge(0.9), size=3.5)+
      scale_fill_brewer(palette="Dark2")  + ggtitle("On Time Ratio") + ylab("on time") +
      theme(plot.title = element_text(hjust = 0.5))

    Export File

       Export file of final data

    #Please change the directory accordingly to your file system 
    setwd("C:/Users/aisha/Dropbox/CUNY/Semester1/DATA607_Data_Acquisition_and_Management/Week5")
    write.csv(filght_status,"flight_arrival_data.csv")

    Conclusion

       Based on the data analysis, some observation are given below considering airlines and cities

    Alaska

  • On general, the delay time is less compared to AM west airlines
  • Alaska is pretty good on on_time reaching on 3 cities (Los Angeles, Phoenix and San Diego), they have more than 88.91% on arrival time
  • Alaska still have to improve on 2 cities San Francisco and Seattle, their delay time is more than 14%
  • AM West

  • Delay time is always higher then Alaska, which is not good
  • San Francisco and Seattle are badly affected by delay time, reaching more than 23%
  • Phoenix is the only city where AM West is close to Alaska
  • Cities

    Cities below ordered by best on_time arrival (1-best and 5-poor)
       1.Pheonix
       2.Los Angeles
       3.San Diego
       4.Seattle
       5.San Francisco