Task:
- Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above.You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
- Read the information from your .CSV file into R, and use tidyr and dplyr as to tidy and transform your data.
- Perform analysis to compare the arrival delays for the two airlines.
Load data into R
airlines <- read.csv("https://cdn.rawgit.com/nschettini/CUNY-MSDS-DATA-607/78383698/flights.csv", stringsAsFactors = F)
Library
library(tidyr)
library(dplyr)
library(tidyverse)
library(kableExtra)
library(knitr)
airlines
## Carrier Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on_time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 am west on_time 694 4840 383 320 201
## 4 am west delayed 117 415 65 129 61
Manipulating the data. Added columns: total_flights, ontime_percent, and delayed_percent
airline_data <- airlines
airline_df <- airline_data %>%
gather("location", "flights", 3:7) %>%
spread(Status, "flights") %>%
mutate(total_flights = delayed + on_time) %>%
mutate(ontime_percent = on_time / total_flights) %>%
mutate(delayed_percent = delayed / total_flights)
airline_df
## Carrier location delayed on_time total_flights ontime_percent
## 1 Alaska Los.Angeles 62 497 559 0.8890877
## 2 Alaska Phoenix 12 221 233 0.9484979
## 3 Alaska San.Diego 20 212 232 0.9137931
## 4 Alaska San.Francisco 102 503 605 0.8314050
## 5 Alaska Seattle 305 1841 2146 0.8578751
## 6 am west Los.Angeles 117 694 811 0.8557337
## 7 am west Phoenix 415 4840 5255 0.9210276
## 8 am west San.Diego 65 383 448 0.8549107
## 9 am west San.Francisco 129 320 449 0.7126949
## 10 am west Seattle 61 201 262 0.7671756
## delayed_percent
## 1 0.11091234
## 2 0.05150215
## 3 0.08620690
## 4 0.16859504
## 5 0.14212488
## 6 0.14426634
## 7 0.07897241
## 8 0.14508929
## 9 0.28730512
## 10 0.23282443
Data manipulation
How many total flights does Alaska and AM WEST airlines have in our dataset?
airline_df %>%
group_by(Carrier) %>%
summarize(totalflights = sum(total_flights))
## # A tibble: 2 x 2
## Carrier totalflights
## <chr> <int>
## 1 Alaska 3775
## 2 am west 7225
Acording to the data, AM WEST has almost double the total # of flights in this dataset.
Which airline and city has the highest on-time percentage?
kable(airline_df %>%
filter(ontime_percent == max(ontime_percent)), "html", escape = F) %>%
kable_styling("striped", full_width = F, font_size = 15)
Carrier | location | delayed | on_time | total_flights | ontime_percent | delayed_percent |
---|---|---|---|---|---|---|
Alaska | Phoenix | 12 | 221 | 233 | 0.9484979 | 0.0515021 |
Airline with the highest on-time percentage from the entire dataset: Alaska airlines from city phoenix has the highest percent of 0.95. It seems that the airport in Phoenix for Alaska airlines is the most efficient with regards to being on time.
Which airline and city has the worst on-time percentage?
kable(airline_df %>%
filter(delayed_percent == max(delayed_percent)), "html", escape = F) %>%
kable_styling("striped", full_width = F, font_size = 15)
Carrier | location | delayed | on_time | total_flights | ontime_percent | delayed_percent |
---|---|---|---|---|---|---|
am west | San.Francisco | 129 | 320 | 449 | 0.7126949 | 0.2873051 |
Airline with most delayed flights is Am west from city San Francisco with a delay percentage of .28. They’re only on time about 71% of the time.
Looking at just this - it looks as if Alaska has a huge lead on on-time percentage.
We do know that the total # of flights for each airline isn’t 1:1. Alaska has 3775 total flights, while AM WEST has 7225 total flights.
airline_alaska <- airline_df %>% filter(airline_df$Carrier == "Alaska")
kable(airline_alaska, "html", escape = F) %>%
kable_styling("striped", full_width = F, font_size = 15)
Carrier | location | delayed | on_time | total_flights | ontime_percent | delayed_percent |
---|---|---|---|---|---|---|
Alaska | Los.Angeles | 62 | 497 | 559 | 0.8890877 | 0.1109123 |
Alaska | Phoenix | 12 | 221 | 233 | 0.9484979 | 0.0515021 |
Alaska | San.Diego | 20 | 212 | 232 | 0.9137931 | 0.0862069 |
Alaska | San.Francisco | 102 | 503 | 605 | 0.8314050 | 0.1685950 |
Alaska | Seattle | 305 | 1841 | 2146 | 0.8578751 | 0.1421249 |
airline_amwest <- airline_df %>% filter(airline_df$Carrier == "am west")
kable(airline_amwest, "html", escape = F) %>%
kable_styling("striped", full_width = F, font_size = 15)
Carrier | location | delayed | on_time | total_flights | ontime_percent | delayed_percent |
---|---|---|---|---|---|---|
am west | Los.Angeles | 117 | 694 | 811 | 0.8557337 | 0.1442663 |
am west | Phoenix | 415 | 4840 | 5255 | 0.9210276 | 0.0789724 |
am west | San.Diego | 65 | 383 | 448 | 0.8549107 | 0.1450893 |
am west | San.Francisco | 129 | 320 | 449 | 0.7126949 | 0.2873051 |
am west | Seattle | 61 | 201 | 262 | 0.7671756 | 0.2328244 |
How do these two airlines really compare with on-time flights?
sum(airline_alaska$on_time)
## [1] 3274
sum(airline_amwest$on_time)
## [1] 6438
Looking at the data, it seems that amwest still has more total flights: ontime than alaska does … but we know that the total number of on_time and delayed aren’t equal/1:1
If we take the proportion of on-time flights vs. the total amount of flights they have …
alaska_ontime <- sum(airline_alaska$on_time)/3775*100
amwest_ontime <- sum(airline_amwest$on_time)/7225*100
alaska_ontime
## [1] 86.72848
amwest_ontime
## [1] 89.10727
sum(airline_alaska$delayed)/3775*100
## [1] 13.27152
sum(airline_amwest$delayed)/7225*100
## [1] 10.89273
It seems that overall alaska and amwest are close in regards to their on-time and delayed percentage.
Visualization
ggplot(airline_df, aes(airline_df$on_time, airline_df$delayed)) +
geom_point(aes(color = airline_df$Carrier)) +
ggtitle("Number of flights on-time vs. delayed for both Airlines") +
xlab("On-time flights") +
ylab("Delayed flights") +
theme_dark()
ggplot(airline_alaska, aes(airline_alaska$total_flights)) +
geom_histogram(aes(fill=airline_alaska$location)) +
ggtitle("Alaska Airlines - total flights vs. location") +
xlab("Total flights") +
ylab("Count") +
theme_dark()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(airline_amwest, aes(airline_amwest$total_flights)) +
geom_histogram(aes(fill=airline_amwest$location)) +
ggtitle("AMWEST Airlines - total flights vs. location") +
xlab("Total flights") +
ylab("Count") +
theme_dark()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Conclusions:
It seems that in looking at this data, both airlines have a similar on-time and dlayed percentage based on their total number of flights. AM WEST, even though with a slightly lower on-time percentage, almost has double the amount of total flights than Alaska Airlines.