The purpose of the assignment is to read in a file with on time and delayed number of flights for 2 airlines and perform an analysis compaing arrival delays of the two airlines
library(tidyr)
library(dplyr)
library(knitr)
airline_status <- read.csv("airline_status.csv", stringsAsFactors = FALSE)
kable(airline_status)
| Airline | Status | Los.Angeles | Phoenix | San.Diego | San.Fransisco | Seattle |
|---|---|---|---|---|---|---|
| Alaska | on-time | 497 | 221 | 212 | 503 | 1841 |
| Alaska | delayed | 62 | 12 | 20 | 102 | 305 |
| AM West | on-time | 694 | 4840 | 383 | 320 | 201 |
| AM West | delayed | 117 | 415 | 65 | 129 | 61 |
Our goal is to create a dataset that looks as follows:
| Airline | City | Delayed | On_Time |
|---|---|---|---|
| Delta | Atlanta | 1,900 | 3,000 |
| Delta | New York | 2,000 | 1,900 |
| AA | Chicago | 1,000 | 2,000 |
Since each observation is an ariline at a particular city.
First, we need to unpivot the column headings and place each column heading into a row.
airline_status$Status[airline_status$Status=="on-time"] <- "on_time" #mutate does not like '-' in a column header
airline_status <-
airline_status %>%
gather("City", "n", 3:7)
kable(airline_status)
| Airline | Status | City | n |
|---|---|---|---|
| Alaska | on_time | Los.Angeles | 497 |
| Alaska | delayed | Los.Angeles | 62 |
| AM West | on_time | Los.Angeles | 694 |
| AM West | delayed | Los.Angeles | 117 |
| Alaska | on_time | Phoenix | 221 |
| Alaska | delayed | Phoenix | 12 |
| AM West | on_time | Phoenix | 4840 |
| AM West | delayed | Phoenix | 415 |
| Alaska | on_time | San.Diego | 212 |
| Alaska | delayed | San.Diego | 20 |
| AM West | on_time | San.Diego | 383 |
| AM West | delayed | San.Diego | 65 |
| Alaska | on_time | San.Fransisco | 503 |
| Alaska | delayed | San.Fransisco | 102 |
| AM West | on_time | San.Fransisco | 320 |
| AM West | delayed | San.Fransisco | 129 |
| Alaska | on_time | Seattle | 1841 |
| Alaska | delayed | Seattle | 305 |
| AM West | on_time | Seattle | 201 |
| AM West | delayed | Seattle | 61 |
Now we need to pivot the on-time and delayed count to the column headings.
airline_status <-
airline_status %>%
spread("Status", "n")
kable(airline_status)
| Airline | City | delayed | on_time |
|---|---|---|---|
| Alaska | Los.Angeles | 62 | 497 |
| Alaska | Phoenix | 12 | 221 |
| Alaska | San.Diego | 20 | 212 |
| Alaska | San.Fransisco | 102 | 503 |
| Alaska | Seattle | 305 | 1841 |
| AM West | Los.Angeles | 117 | 694 |
| AM West | Phoenix | 415 | 4840 |
| AM West | San.Diego | 65 | 383 |
| AM West | San.Fransisco | 129 | 320 |
| AM West | Seattle | 61 | 201 |
To help with the analysis I am adding the ratio of on-time and delayed flights to the data frame.
airline_status <-
airline_status %>%
mutate(ratio_on_time = on_time/(delayed + on_time), ratio_delayed = 1 - ratio_on_time )
kable(airline_status)
| Airline | City | delayed | on_time | ratio_on_time | ratio_delayed |
|---|---|---|---|---|---|
| Alaska | Los.Angeles | 62 | 497 | 0.8890877 | 0.1109123 |
| Alaska | Phoenix | 12 | 221 | 0.9484979 | 0.0515021 |
| Alaska | San.Diego | 20 | 212 | 0.9137931 | 0.0862069 |
| Alaska | San.Fransisco | 102 | 503 | 0.8314050 | 0.1685950 |
| Alaska | Seattle | 305 | 1841 | 0.8578751 | 0.1421249 |
| AM West | Los.Angeles | 117 | 694 | 0.8557337 | 0.1442663 |
| AM West | Phoenix | 415 | 4840 | 0.9210276 | 0.0789724 |
| AM West | San.Diego | 65 | 383 | 0.8549107 | 0.1450893 |
| AM West | San.Fransisco | 129 | 320 | 0.7126949 | 0.2873051 |
| AM West | Seattle | 61 | 201 | 0.7671756 | 0.2328244 |
My first analysis will be to see what airline has the best overall on-time rate. I will do this by grouping the data frame by airline and summing the on-time and delayed flights for that airline and then calculate the on-time ratio at the higher level of analysis.
total_percent_on_time <-
airline_status %>%
group_by(Airline) %>%
summarise (total_delayed = sum(delayed), total_on_time = sum(on_time)) %>%
mutate (total_ratio_on_time = total_on_time / ( total_on_time + total_delayed ))
kable(total_percent_on_time)
| Airline | total_delayed | total_on_time | total_ratio_on_time |
|---|---|---|---|
| Alaska | 501 | 3274 | 0.8672848 |
| AM West | 787 | 6438 | 0.8910727 |
most_on_time_airline <-
total_percent_on_time %>%
filter (total_ratio_on_time == max(total_ratio_on_time))
The second analysis I will perform is to see what airline has the most variance in its on-time performance. To do this I will calculate the straight average and standard deviation of the airlines on-time ratio, and then see what airline has the lowest standard deviation.
summary_stats_by_airline <-
airline_status %>%
group_by(Airline) %>%
summarise(mean_ratio_ontime = mean(ratio_on_time), sd_ratio_on_time = sd(ratio_on_time))
kable(summary_stats_by_airline)
| Airline | mean_ratio_ontime | sd_ratio_on_time |
|---|---|---|
| Alaska | 0.8881317 | 0.0459262 |
| AM West | 0.8223085 | 0.0821285 |
most_consistent_airline <-
summary_stats_by_airline %>%
filter (sd_ratio_on_time == min(sd_ratio_on_time))
The airline with the best on-time percentage is AM West with on on-time percentage of 89.11%
The airline with the most consistent performance is Alaska with a standard deviation of the percentage on-time of 4.59%