The data contained in the table above are replicated in a .csv file.
The data is read into a data frame, using the check.names = FALSE option to avoid inserting unnecessary periods in cities with spaces:
airlines <- read.table('./airlines.csv', sep = ",", check.names = FALSE, header = TRUE, stringsAsFactors = FALSE)
The first two column names are filled in, and the blank entries in the airline column are filled in:
names(airlines) <- c('Airline', 'Status', names(airlines)[3:7])
for (i in 1:nrow(airlines)) {
if (airlines[i, 1] == "") {
airlines[i, 1] = airlines[i - 1, 1]
}
}
To tidy the data, the frequencies are gathered by destination using tidyr. The blank row is filtered out using dplyr:
airlines <- airlines %>%
gather(Destination, Frequency, -Airline, -Status) %>%
filter(Frequency != "")
Flight statuses are renamed for consistency, and then spread out in order to allow for analysis:
airlines$Status[airlines$Status == "on time"] <- "OnTime"
airlines$Status[airlines$Status == "delayed"] <- "Delayed"
airlines <- airlines %>%
spread(Status, Frequency)
kable(airlines)
| Airline | Destination | Delayed | OnTime |
|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San Diego | 20 | 212 |
| ALASKA | San Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AM WEST | Los Angeles | 117 | 694 |
| AM WEST | Phoenix | 415 | 4840 |
| AM WEST | San Diego | 65 | 383 |
| AM WEST | San Francisco | 129 | 320 |
| AM WEST | Seattle | 61 | 201 |
In order to analyze the two airlines’ performance, the mutate function is used to calculate the total number of flights and use that to calculate the percentage of flights that arrive on time:
airlines <- airlines %>%
mutate(
Total = Delayed + OnTime,
OnTimeRate = OnTime / Total) %>%
select(-Delayed, -OnTime)
kable(airlines)
| Airline | Destination | Total | OnTimeRate |
|---|---|---|---|
| ALASKA | Los Angeles | 559 | 0.8890877 |
| ALASKA | Phoenix | 233 | 0.9484979 |
| ALASKA | San Diego | 232 | 0.9137931 |
| ALASKA | San Francisco | 605 | 0.8314050 |
| ALASKA | Seattle | 2146 | 0.8578751 |
| AM WEST | Los Angeles | 811 | 0.8557337 |
| AM WEST | Phoenix | 5255 | 0.9210276 |
| AM WEST | San Diego | 448 | 0.8549107 |
| AM WEST | San Francisco | 449 | 0.7126949 |
| AM WEST | Seattle | 262 | 0.7671756 |
The airlines’ on-time performance at each of the five airports is visualized using ggplot2:
From this chart, it is clear that Alaska has better on-time performance at each of the five destinations. To investigate if there is a relationship between on-time performance and volume of flights to a destination:
From this plot, there does not seem to be much relationship between flight volume and on-time performance — the overall correlation between the two variables is only 0.302.