CUNY MSDS DATA 607 - Tidying and Transforming Data

Nick Schettini

2018-03-02

Task:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above.You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
  2. Read the information from your .CSV file into R, and use tidyr and dplyr as to tidy and transform your data.
  3. Perform analysis to compare the arrival delays for the two airlines.

Load data into R

airlines <- read.csv("https://cdn.rawgit.com/nschettini/CUNY-MSDS-DATA-607/78383698/flights.csv", stringsAsFactors = F)

Library

library(tidyr)
library(dplyr)
library(tidyverse)
library(kableExtra)
library(knitr)
airlines
##   Carrier  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on_time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3 am west on_time         694    4840       383           320     201
## 4 am west delayed         117     415        65           129      61

Manipulating the data. Added columns: total_flights, ontime_percent, and delayed_percent

airline_data <- airlines
airline_df <- airline_data %>%
  gather("location", "flights", 3:7) %>%
  spread(Status, "flights") %>%
  mutate(total_flights = delayed + on_time) %>%
  mutate(ontime_percent = on_time / total_flights) %>%
  mutate(delayed_percent = delayed / total_flights)
airline_df
##    Carrier      location delayed on_time total_flights ontime_percent
## 1   Alaska   Los.Angeles      62     497           559      0.8890877
## 2   Alaska       Phoenix      12     221           233      0.9484979
## 3   Alaska     San.Diego      20     212           232      0.9137931
## 4   Alaska San.Francisco     102     503           605      0.8314050
## 5   Alaska       Seattle     305    1841          2146      0.8578751
## 6  am west   Los.Angeles     117     694           811      0.8557337
## 7  am west       Phoenix     415    4840          5255      0.9210276
## 8  am west     San.Diego      65     383           448      0.8549107
## 9  am west San.Francisco     129     320           449      0.7126949
## 10 am west       Seattle      61     201           262      0.7671756
##    delayed_percent
## 1       0.11091234
## 2       0.05150215
## 3       0.08620690
## 4       0.16859504
## 5       0.14212488
## 6       0.14426634
## 7       0.07897241
## 8       0.14508929
## 9       0.28730512
## 10      0.23282443

Data manipulation

How many total flights does Alaska and AM WEST airlines have in our dataset?

airline_df %>%
  group_by(Carrier) %>%
  summarize(totalflights = sum(total_flights))
## # A tibble: 2 x 2
##   Carrier totalflights
##   <chr>          <int>
## 1 Alaska          3775
## 2 am west         7225

Acording to the data, AM WEST has almost double the total # of flights in this dataset.

Which airline and city has the highest on-time percentage?

kable(airline_df %>%
  filter(ontime_percent == max(ontime_percent)), "html", escape = F) %>%
  kable_styling("striped", full_width = F, font_size = 15)
Carrier location delayed on_time total_flights ontime_percent delayed_percent
Alaska Phoenix 12 221 233 0.9484979 0.0515021

Airline with the highest on-time percentage from the entire dataset: Alaska airlines from city phoenix has the highest percent of 0.95. It seems that the airport in Phoenix for Alaska airlines is the most efficient with regards to being on time.

Which airline and city has the worst on-time percentage?

kable(airline_df %>%
  filter(delayed_percent == max(delayed_percent)), "html", escape = F) %>%
  kable_styling("striped", full_width = F, font_size = 15)
Carrier location delayed on_time total_flights ontime_percent delayed_percent
am west San.Francisco 129 320 449 0.7126949 0.2873051

Airline with most delayed flights is Am west from city San Francisco with a delay percentage of .28. They’re only on time about 71% of the time.

Looking at just this - it looks as if Alaska has a huge lead on on-time percentage.

We do know that the total # of flights for each airline isn’t 1:1. Alaska has 3775 total flights, while AM WEST has 7225 total flights.

airline_alaska <- airline_df %>% filter(airline_df$Carrier == "Alaska")
kable(airline_alaska, "html", escape = F) %>%
  kable_styling("striped", full_width = F, font_size = 15)
Carrier location delayed on_time total_flights ontime_percent delayed_percent
Alaska Los.Angeles 62 497 559 0.8890877 0.1109123
Alaska Phoenix 12 221 233 0.9484979 0.0515021
Alaska San.Diego 20 212 232 0.9137931 0.0862069
Alaska San.Francisco 102 503 605 0.8314050 0.1685950
Alaska Seattle 305 1841 2146 0.8578751 0.1421249
airline_amwest <- airline_df %>% filter(airline_df$Carrier == "am west")
kable(airline_amwest, "html", escape = F) %>%
  kable_styling("striped", full_width = F, font_size = 15)
Carrier location delayed on_time total_flights ontime_percent delayed_percent
am west Los.Angeles 117 694 811 0.8557337 0.1442663
am west Phoenix 415 4840 5255 0.9210276 0.0789724
am west San.Diego 65 383 448 0.8549107 0.1450893
am west San.Francisco 129 320 449 0.7126949 0.2873051
am west Seattle 61 201 262 0.7671756 0.2328244

How do these two airlines really compare with on-time flights?

sum(airline_alaska$on_time)
## [1] 3274
sum(airline_amwest$on_time)
## [1] 6438

Looking at the data, it seems that amwest still has more total flights: ontime than alaska does … but we know that the total number of on_time and delayed aren’t equal/1:1

If we take the proportion of on-time flights vs. the total amount of flights they have …

alaska_ontime <- sum(airline_alaska$on_time)/3775*100
amwest_ontime <- sum(airline_amwest$on_time)/7225*100
alaska_ontime
## [1] 86.72848
amwest_ontime
## [1] 89.10727
sum(airline_alaska$delayed)/3775*100
## [1] 13.27152
sum(airline_amwest$delayed)/7225*100
## [1] 10.89273

It seems that overall alaska and amwest are close in regards to their on-time and delayed percentage.

Visualization

ggplot(airline_df, aes(airline_df$on_time, airline_df$delayed)) +
  geom_point(aes(color = airline_df$Carrier)) + 
  ggtitle("Number of flights on-time vs. delayed for both Airlines") +
  xlab("On-time flights") +
  ylab("Delayed flights") +
  theme_dark()

ggplot(airline_alaska, aes(airline_alaska$total_flights)) + 
  geom_histogram(aes(fill=airline_alaska$location)) +
  ggtitle("Alaska Airlines - total flights vs. location") +
  xlab("Total flights") +
  ylab("Count") +
  theme_dark()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(airline_amwest, aes(airline_amwest$total_flights)) + 
  geom_histogram(aes(fill=airline_amwest$location)) +
  ggtitle("AMWEST Airlines - total flights vs. location") +
  xlab("Total flights") +
  ylab("Count") +
  theme_dark()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Conclusions:

It seems that in looking at this data, both airlines have a similar on-time and dlayed percentage based on their total number of flights. AM WEST, even though with a slightly lower on-time percentage, almost has double the amount of total flights than Alaska Airlines.