The chart above describes arrival delays for two airlines across five destinations. Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
# Create the "flight_status" dataframe.
flight_status <- rbind(c(NA, NA, 'Los Angeles', 'Phoenix', 'San Diego', 'San Francisco', 'Seattle'),
                       c('ALASKA', 'on time', 497, 221, 212, 503, 1841),
                       c(NA, 'delayed', 62, 12, 20, 102, 305),
                       c('AM WEST', 'on time', 694, 4840, 383, 320, 201),
                       c(NA, 'delayed', 117, 415, 65, 129, 61))
                       
# Write the dataframe to a CSV file. 
write.table(flight_status, '/Users/stephenhaslett/Desktop/flight_status.csv', sep=",", col.names=FALSE, row.names=FALSE)
  1. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data:
Step 1

Read the data from the “flight_status.csv” into R.

# Import the data from the CSV file into R.
csv_file <- '/Users/stephenhaslett/Desktop/flight_status.csv'
flight_status_data_wide <- read.csv(csv_file, stringsAsFactors = FALSE)
flight_status_data_wide
##       NA.   NA..1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4    <NA> delayed         117     415        65           129      61
Step 2

The original table data is in a wide structured format. Use the “tidyr” and “dplyr” libraries to first clean up the data (removing blank values etc.), and then convert the format to a long structured format.

library(tidyr)
library(dplyr)

csv_file <- '/Users/stephenhaslett/Desktop/flight_status.csv'
flight_status_data_wide <- read.csv(csv_file, stringsAsFactors = FALSE)

# Rename the NA., and NA..1 column names from the original table so that they are meaningful.
flight_status_data_wide <- flight_status_data_wide %>% rename(Airline = NA., Status = NA..1)

# Replace the <NA> values in the Airline column for rows 2 and 4 with the airline name.
flight_status_data_wide[2,1] <- flight_status_data_wide[1,1]
flight_status_data_wide[4,1] <- flight_status_data_wide[3,1]

# Use the "gather" function to remove the 5 city columns in the orginal table and combine these into 1 column named "Destination".
flight_status_data_long <- gather(flight_status_data_wide, Destination, Flights, Los.Angeles:Seattle) %>%
                           # Replace the periods between city names with spaces.
                           mutate(Destination = gsub( "\\.", " ", Destination)) %>%
                           # Sort the data by airline name.
                           arrange(Airline)
flight_status_data_long
##    Airline  Status   Destination Flights
## 1   ALASKA on time   Los Angeles     497
## 2   ALASKA delayed   Los Angeles      62
## 3   ALASKA on time       Phoenix     221
## 4   ALASKA delayed       Phoenix      12
## 5   ALASKA on time     San Diego     212
## 6   ALASKA delayed     San Diego      20
## 7   ALASKA on time San Francisco     503
## 8   ALASKA delayed San Francisco     102
## 9   ALASKA on time       Seattle    1841
## 10  ALASKA delayed       Seattle     305
## 11 AM WEST on time   Los Angeles     694
## 12 AM WEST delayed   Los Angeles     117
## 13 AM WEST on time       Phoenix    4840
## 14 AM WEST delayed       Phoenix     415
## 15 AM WEST on time     San Diego     383
## 16 AM WEST delayed     San Diego      65
## 17 AM WEST on time San Francisco     320
## 18 AM WEST delayed San Francisco     129
## 19 AM WEST on time       Seattle     201
## 20 AM WEST delayed       Seattle      61
Step 3

Prepare the data for analysis using the “dplyr” library so that we can compare the arrival delays for the 2 airlines.

library(dplyr)

flights <- flight_status_data_long %>% spread(Status, Flights)

flight_data <- flights %>% mutate(TotalFlights = flights$'on time' + flights$delayed) %>%
           mutate('OnTimePercentage' = round(flights$'on time'/TotalFlights * 100, 1)) %>%
           mutate('DelayedPercentage' = round(flights$delayed/TotalFlights * 100, 1))

flight_data
##    Airline   Destination delayed on time TotalFlights OnTimePercentage
## 1   ALASKA   Los Angeles      62     497          559             88.9
## 2   ALASKA       Phoenix      12     221          233             94.8
## 3   ALASKA     San Diego      20     212          232             91.4
## 4   ALASKA San Francisco     102     503          605             83.1
## 5   ALASKA       Seattle     305    1841         2146             85.8
## 6  AM WEST   Los Angeles     117     694          811             85.6
## 7  AM WEST       Phoenix     415    4840         5255             92.1
## 8  AM WEST     San Diego      65     383          448             85.5
## 9  AM WEST San Francisco     129     320          449             71.3
## 10 AM WEST       Seattle      61     201          262             76.7
##    DelayedPercentage
## 1               11.1
## 2                5.2
## 3                8.6
## 4               16.9
## 5               14.2
## 6               14.4
## 7                7.9
## 8               14.5
## 9               28.7
## 10              23.3
  1. Perform analysis to compare the arrival delays for the two airlines.
Step 1

Output the arrival delays data as a series of bar graphs that compare the percentage of delayed flights by airline and destination.

library("ggplot2")

delayed_flight_charts <- ggplot(flight_data, aes(y=DelayedPercentage, x = Airline, color = Airline, fill = Airline)) + 
    geom_bar(stat = "identity") +
    geom_text(aes(label = DelayedPercentage), vjust = -0.3) +
    facet_wrap(~Destination) +
    ylim(0, 40) +
    ylab("Percentage of Delays") +
    ggtitle("Percentage of Delays by Airline and Destination")
    
delayed_flight_charts

Step 2

Output the total number of flights offered per airline, per destination.

total_flights_charts <- ggplot(flight_data, aes(y = TotalFlights, x = Airline, color = Airline, fill = Airline)) + 
    geom_bar( stat = "identity", position = "dodge") +
    geom_text(aes(label = TotalFlights), position=position_dodge(width=0.9), vjust=-0.5) +
    facet_wrap(~Destination) +
    ylim(0, 6000) +
    ylab("Total Flights") +
    ggtitle("Total Number of Flights Per Airline and Destination")

total_flights_charts

Conclusion

From the above analysis, it is clear that Alaska Airlines has a smaller percentage of delayed arrivals per destination than does American West Airlines. With the exceptions of Seattle and San Francisco, American West Airlines offer more flights per destination than Alaska Airlines. However, even in the case of these exceptions, American West Airlines still has a higher percentage of delayed arrivals.

Based on this information, I believe we can conclude that Alaska Airlines would be the better choice of airline if your objective is to arrive at your destination on time.