Kelly Shaffer

Week 5 Homework for DATA607

The data describes arrival delays for four airlines across five destinations. Deliverables include:

1) Creating a .csv file that includes the flight information in the table using a wide structure.

2) Reading the .csv file into R, and using tidyr and dplyr to tidy and transform the data.

3) Performing analysis to compare the arrival delays for the four airlines.

4) Saving to an r markdown file, posting to rpubs, posting to github.

#install.packages("rio")
#install.packages("RCurl")
#install.packages("bitops")
#install.packages("ggplot2")
#install.packages("tidyr")
#install.packages("dplyr")
library(rio)
library(RCurl)
## Loading required package: bitops
library(bitops)
library(ggplot2)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Load the data
x <- getURL("https://raw.githubusercontent.com/excelsiordata/DATA607/master/FlightDelays.csv")
FlightDelays <- read.csv(text = x, head=TRUE, sep=",", stringsAsFactors=FALSE, col.names = c("Airline", "Status", "Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"))

#Take a look at the data and make sure everything loaded in properly
head(FlightDelays)
##     Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1    ALASKA on time         497     221       212           503    1841
## 2    ALASKA delayed          62      12        20           102     305
## 3   AM WEST on time         694    4840       383           320     201
## 4   AM WEST delayed         117     415        65           129      61
## 5 Allegiant on time         658     425       352           645     648
## 6 Allegiant delayed         100     150       167            20      40
#Looks good!

#Let's get the separate city columns into rows
TidyFlightDelays1 <- data.frame(gather(FlightDelays, "City", "n", 3:7))

(TidyFlightDelays1)
##      Airline  Status          City    n
## 1     ALASKA on time   Los.Angeles  497
## 2     ALASKA delayed   Los.Angeles   62
## 3    AM WEST on time   Los.Angeles  694
## 4    AM WEST delayed   Los.Angeles  117
## 5  Allegiant on time   Los.Angeles  658
## 6  Allegiant delayed   Los.Angeles  100
## 7  Southwest on time   Los.Angeles  700
## 8  Southwest delayed   Los.Angeles   86
## 9     ALASKA on time       Phoenix  221
## 10    ALASKA delayed       Phoenix   12
## 11   AM WEST on time       Phoenix 4840
## 12   AM WEST delayed       Phoenix  415
## 13 Allegiant on time       Phoenix  425
## 14 Allegiant delayed       Phoenix  150
## 15 Southwest on time       Phoenix  640
## 16 Southwest delayed       Phoenix   77
## 17    ALASKA on time     San.Diego  212
## 18    ALASKA delayed     San.Diego   20
## 19   AM WEST on time     San.Diego  383
## 20   AM WEST delayed     San.Diego   65
## 21 Allegiant on time     San.Diego  352
## 22 Allegiant delayed     San.Diego  167
## 23 Southwest on time     San.Diego  395
## 24 Southwest delayed     San.Diego  115
## 25    ALASKA on time San.Francisco  503
## 26    ALASKA delayed San.Francisco  102
## 27   AM WEST on time San.Francisco  320
## 28   AM WEST delayed San.Francisco  129
## 29 Allegiant on time San.Francisco  645
## 30 Allegiant delayed San.Francisco   20
## 31 Southwest on time San.Francisco  452
## 32 Southwest delayed San.Francisco  103
## 33    ALASKA on time       Seattle 1841
## 34    ALASKA delayed       Seattle  305
## 35   AM WEST on time       Seattle  201
## 36   AM WEST delayed       Seattle   61
## 37 Allegiant on time       Seattle  648
## 38 Allegiant delayed       Seattle   40
## 39 Southwest on time       Seattle  521
## 40 Southwest delayed       Seattle   89
tbl_df(TidyFlightDelays1)
## # A tibble: 40 × 4
##      Airline  Status        City     n
##        <chr>   <chr>       <chr> <int>
## 1     ALASKA on time Los.Angeles   497
## 2     ALASKA delayed Los.Angeles    62
## 3    AM WEST on time Los.Angeles   694
## 4    AM WEST delayed Los.Angeles   117
## 5  Allegiant on time Los.Angeles   658
## 6  Allegiant delayed Los.Angeles   100
## 7  Southwest on time Los.Angeles   700
## 8  Southwest delayed Los.Angeles    86
## 9     ALASKA on time     Phoenix   221
## 10    ALASKA delayed     Phoenix    12
## # ... with 30 more rows
glimpse(TidyFlightDelays1)
## Observations: 40
## Variables: 4
## $ Airline <chr> "ALASKA", "ALASKA", "AM WEST", "AM WEST", "Allegiant",...
## $ Status  <chr> "on time", "delayed", "on time", "delayed", "on time",...
## $ City    <chr> "Los.Angeles", "Los.Angeles", "Los.Angeles", "Los.Ange...
## $ n       <int> 497, 62, 694, 117, 658, 100, 700, 86, 221, 12, 4840, 4...
View(TidyFlightDelays1)

#Let's split on time/delayed out into their own columns
TidyFlightDelays2 <- data.frame(spread(TidyFlightDelays1, "Status", "n", 2:2))

(TidyFlightDelays2)
##      Airline          City delayed on.time
## 1     ALASKA   Los.Angeles      62     497
## 2     ALASKA       Phoenix      12     221
## 3     ALASKA     San.Diego      20     212
## 4     ALASKA San.Francisco     102     503
## 5     ALASKA       Seattle     305    1841
## 6  Allegiant   Los.Angeles     100     658
## 7  Allegiant       Phoenix     150     425
## 8  Allegiant     San.Diego     167     352
## 9  Allegiant San.Francisco      20     645
## 10 Allegiant       Seattle      40     648
## 11   AM WEST   Los.Angeles     117     694
## 12   AM WEST       Phoenix     415    4840
## 13   AM WEST     San.Diego      65     383
## 14   AM WEST San.Francisco     129     320
## 15   AM WEST       Seattle      61     201
## 16 Southwest   Los.Angeles      86     700
## 17 Southwest       Phoenix      77     640
## 18 Southwest     San.Diego     115     395
## 19 Southwest San.Francisco     103     452
## 20 Southwest       Seattle      89     521
tbl_df(TidyFlightDelays2)
## # A tibble: 20 × 4
##      Airline          City delayed on.time
## *      <chr>         <chr>   <int>   <int>
## 1     ALASKA   Los.Angeles      62     497
## 2     ALASKA       Phoenix      12     221
## 3     ALASKA     San.Diego      20     212
## 4     ALASKA San.Francisco     102     503
## 5     ALASKA       Seattle     305    1841
## 6  Allegiant   Los.Angeles     100     658
## 7  Allegiant       Phoenix     150     425
## 8  Allegiant     San.Diego     167     352
## 9  Allegiant San.Francisco      20     645
## 10 Allegiant       Seattle      40     648
## 11   AM WEST   Los.Angeles     117     694
## 12   AM WEST       Phoenix     415    4840
## 13   AM WEST     San.Diego      65     383
## 14   AM WEST San.Francisco     129     320
## 15   AM WEST       Seattle      61     201
## 16 Southwest   Los.Angeles      86     700
## 17 Southwest       Phoenix      77     640
## 18 Southwest     San.Diego     115     395
## 19 Southwest San.Francisco     103     452
## 20 Southwest       Seattle      89     521
glimpse(TidyFlightDelays2)
## Observations: 20
## Variables: 4
## $ Airline <chr> "ALASKA", "ALASKA", "ALASKA", "ALASKA", "ALASKA", "All...
## $ City    <chr> "Los.Angeles", "Phoenix", "San.Diego", "San.Francisco"...
## $ delayed <int> 62, 12, 20, 102, 305, 100, 150, 167, 20, 40, 117, 415,...
## $ on.time <int> 497, 221, 212, 503, 1841, 658, 425, 352, 645, 648, 694...
View(TidyFlightDelays2)

Now that we’ve got our data frame into tidy format, we can begin to analyze it. TidyFlightDelays1 is not tidy, but TidyFlightDelays2 is.

TidyFlightDelays1 %>%
  group_by(Airline, Status) %>%
  summarise(Mean = mean(n), Total.Flights = sum(sum(n))) %>%
  arrange(Mean)
## Source: local data frame [8 x 4]
## Groups: Airline [4]
## 
##     Airline  Status   Mean Total.Flights
##       <chr>   <chr>  <dbl>         <int>
## 1 Southwest delayed   94.0           470
## 2 Allegiant delayed   95.4           477
## 3    ALASKA delayed  100.2           501
## 4   AM WEST delayed  157.4           787
## 5 Southwest on time  541.6          2708
## 6 Allegiant on time  545.6          2728
## 7    ALASKA on time  654.8          3274
## 8   AM WEST on time 1287.6          6438

As you can see here, this isn’t a very useful view of the data because the volume of flights varies per airline, so mean doesn’t mean anything. This data also isn’t tidy, since on time and delayed have separate rows in the same column. Overall, we can see that this data is pretty difficult to work with.

Let’s work with the tidy data frame going forward, TidyFlightDelays2.

TidyFlightDelays2 %>%
  group_by(Airline) %>% 
  mutate(Ratio = delayed/on.time) %>% 
  arrange(desc(Ratio))
## Source: local data frame [20 x 5]
## Groups: Airline [4]
## 
##      Airline          City delayed on.time      Ratio
##        <chr>         <chr>   <int>   <int>      <dbl>
## 1  Allegiant     San.Diego     167     352 0.47443182
## 2    AM WEST San.Francisco     129     320 0.40312500
## 3  Allegiant       Phoenix     150     425 0.35294118
## 4    AM WEST       Seattle      61     201 0.30348259
## 5  Southwest     San.Diego     115     395 0.29113924
## 6  Southwest San.Francisco     103     452 0.22787611
## 7     ALASKA San.Francisco     102     503 0.20278330
## 8  Southwest       Seattle      89     521 0.17082534
## 9    AM WEST     San.Diego      65     383 0.16971279
## 10   AM WEST   Los.Angeles     117     694 0.16858790
## 11    ALASKA       Seattle     305    1841 0.16567083
## 12 Allegiant   Los.Angeles     100     658 0.15197568
## 13    ALASKA   Los.Angeles      62     497 0.12474849
## 14 Southwest   Los.Angeles      86     700 0.12285714
## 15 Southwest       Phoenix      77     640 0.12031250
## 16    ALASKA     San.Diego      20     212 0.09433962
## 17   AM WEST       Phoenix     415    4840 0.08574380
## 18 Allegiant       Seattle      40     648 0.06172840
## 19    ALASKA       Phoenix      12     221 0.05429864
## 20 Allegiant San.Francisco      20     645 0.03100775

Count of delayed and on time flights isn’t a very useful view of the data because the volume of flights varies per airline, as we saw before. I thought a useful way to compare the data in an “apples to apples” fashion would be to compute the ratio of delayed to on time flights. This allows us to truly compare the airlines. Taking the city out of the picture will also help reach a final number per airline we can use for comparison. Let’s do that now.

TidyFlightDelays2 %>%
  group_by(Airline) %>% 
  summarise(Delayed=sum(delayed), On.Time=sum(on.time)) %>% 
  mutate(Ratio = Delayed/On.Time) %>% 
  arrange(desc(Ratio))
## # A tibble: 4 × 4
##     Airline Delayed On.Time     Ratio
##       <chr>   <int>   <int>     <dbl>
## 1 Allegiant     477    2728 0.1748534
## 2 Southwest     470    2708 0.1735598
## 3    ALASKA     501    3274 0.1530238
## 4   AM WEST     787    6438 0.1222429

With city factored in and not factored in, Allegiant still has the highest ratio of delayed flights to on time flights.

Flying out of San Diego on Allegiant is where you are most likely to encounter a flight delay. Interestingly, flying Allegiant out of San Francisco will be your best bet in terms of avoiding these delays.

Flying out of AM West is your best bet overall, regardless of city.

FlightDelays <- ggplot(TidyFlightDelays2, aes(x=TidyFlightDelays2$delayed, y=TidyFlightDelays2$on.time)) +
  geom_point()

FlightDelays <- FlightDelays + labs(list(
  title = "Delayed Flights vs. On Time Flights",
  x="Delayed Flights",
  y="On Time Flights"))
print(FlightDelays)