The chart above describes arrival delays for two airlines across five destinations. Your task is to:
# Create the "flight_status" dataframe.
flight_status <- rbind(c(NA, NA, 'Los Angeles', 'Phoenix', 'San Diego', 'San Francisco', 'Seattle'),
c('ALASKA', 'on time', 497, 221, 212, 503, 1841),
c(NA, 'delayed', 62, 12, 20, 102, 305),
c('AM WEST', 'on time', 694, 4840, 383, 320, 201),
c(NA, 'delayed', 117, 415, 65, 129, 61))
# Write the dataframe to a CSV file.
write.table(flight_status, '/Users/stephenhaslett/Desktop/flight_status.csv', sep=",", col.names=FALSE, row.names=FALSE)
Read the data from the “flight_status.csv” into R.
# Import the data from the CSV file into R.
csv_file <- '/Users/stephenhaslett/Desktop/flight_status.csv'
flight_status_data_wide <- read.csv(csv_file, stringsAsFactors = FALSE)
flight_status_data_wide
## NA. NA..1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
The original table data is in a wide structured format. Use the “tidyr” and “dplyr” libraries to first clean up the data (removing blank values etc.), and then convert the format to a long structured format.
library(tidyr)
library(dplyr)
csv_file <- '/Users/stephenhaslett/Desktop/flight_status.csv'
flight_status_data_wide <- read.csv(csv_file, stringsAsFactors = FALSE)
# Rename the NA., and NA..1 column names from the original table so that they are meaningful.
flight_status_data_wide <- flight_status_data_wide %>% rename(Airline = NA., Status = NA..1)
# Replace the <NA> values in the Airline column for rows 2 and 4 with the airline name.
flight_status_data_wide[2,1] <- flight_status_data_wide[1,1]
flight_status_data_wide[4,1] <- flight_status_data_wide[3,1]
# Use the "gather" function to remove the 5 city columns in the orginal table and combine these into 1 column named "Destination".
flight_status_data_long <- gather(flight_status_data_wide, Destination, Flights, Los.Angeles:Seattle) %>%
# Replace the periods between city names with spaces.
mutate(Destination = gsub( "\\.", " ", Destination)) %>%
# Sort the data by airline name.
arrange(Airline)
flight_status_data_long
## Airline Status Destination Flights
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 ALASKA on time Phoenix 221
## 4 ALASKA delayed Phoenix 12
## 5 ALASKA on time San Diego 212
## 6 ALASKA delayed San Diego 20
## 7 ALASKA on time San Francisco 503
## 8 ALASKA delayed San Francisco 102
## 9 ALASKA on time Seattle 1841
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST delayed Los Angeles 117
## 13 AM WEST on time Phoenix 4840
## 14 AM WEST delayed Phoenix 415
## 15 AM WEST on time San Diego 383
## 16 AM WEST delayed San Diego 65
## 17 AM WEST on time San Francisco 320
## 18 AM WEST delayed San Francisco 129
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
Prepare the data for analysis using the “dplyr” library so that we can compare the arrival delays for the 2 airlines.
library(dplyr)
flights <- flight_status_data_long %>% spread(Status, Flights)
flight_data <- flights %>% mutate(TotalFlights = flights$'on time' + flights$delayed) %>%
mutate('OnTimePercentage' = round(flights$'on time'/TotalFlights * 100, 1)) %>%
mutate('DelayedPercentage' = round(flights$delayed/TotalFlights * 100, 1))
flight_data
## Airline Destination delayed on time TotalFlights OnTimePercentage
## 1 ALASKA Los Angeles 62 497 559 88.9
## 2 ALASKA Phoenix 12 221 233 94.8
## 3 ALASKA San Diego 20 212 232 91.4
## 4 ALASKA San Francisco 102 503 605 83.1
## 5 ALASKA Seattle 305 1841 2146 85.8
## 6 AM WEST Los Angeles 117 694 811 85.6
## 7 AM WEST Phoenix 415 4840 5255 92.1
## 8 AM WEST San Diego 65 383 448 85.5
## 9 AM WEST San Francisco 129 320 449 71.3
## 10 AM WEST Seattle 61 201 262 76.7
## DelayedPercentage
## 1 11.1
## 2 5.2
## 3 8.6
## 4 16.9
## 5 14.2
## 6 14.4
## 7 7.9
## 8 14.5
## 9 28.7
## 10 23.3
Output the arrival delays data as a series of bar graphs that compare the percentage of delayed flights by airline and destination.
library("ggplot2")
delayed_flight_charts <- ggplot(flight_data, aes(y=DelayedPercentage, x = Airline, color = Airline, fill = Airline)) +
geom_bar(stat = "identity") +
geom_text(aes(label = DelayedPercentage), vjust = -0.3) +
facet_wrap(~Destination) +
ylim(0, 40) +
ylab("Percentage of Delays") +
ggtitle("Percentage of Delays by Airline and Destination")
delayed_flight_charts
Output the total number of flights offered per airline, per destination.
total_flights_charts <- ggplot(flight_data, aes(y = TotalFlights, x = Airline, color = Airline, fill = Airline)) +
geom_bar( stat = "identity", position = "dodge") +
geom_text(aes(label = TotalFlights), position=position_dodge(width=0.9), vjust=-0.5) +
facet_wrap(~Destination) +
ylim(0, 6000) +
ylab("Total Flights") +
ggtitle("Total Number of Flights Per Airline and Destination")
total_flights_charts
From the above analysis, it is clear that Alaska Airlines has a smaller percentage of delayed arrivals per destination than does American West Airlines. With the exceptions of Seattle and San Francisco, American West Airlines offer more flights per destination than Alaska Airlines. However, even in the case of these exceptions, American West Airlines still has a higher percentage of delayed arrivals.
Based on this information, I believe we can conclude that Alaska Airlines would be the better choice of airline if your objective is to arrive at your destination on time.