Question: What day (as name, not date) in May 2011 did American Airlines fly the fastest? And how did May compare to the rest of the year?

For our analysis of the hflights database, we’ll need to install the hflights package and check the headers.

install.packages("hflights")
require(hflights)
## Loading required package: hflights
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0

To answer our question, we’ll only need a few columns so we’ll create a new data frame with only certain columns.

flight1 <- hflights[1:4]
flight2 <- hflights[7]
flight3 <- hflights[16]
houstonf <- cbind(flight1, flight2, flight3)
head(houstonf)
##      Year Month DayofMonth DayOfWeek UniqueCarrier Distance
## 5424 2011     1          1         6            AA      224
## 5425 2011     1          2         7            AA      224
## 5426 2011     1          3         1            AA      224
## 5427 2011     1          4         2            AA      224
## 5428 2011     1          5         3            AA      224
## 5429 2011     1          6         4            AA      224

Now we’ll add a new column to see how the ActualElapsedTime and Distance connect as an MileperMin column, this can also be thought of more commonly as “speed”.

MilesperMin <- hflights$Distance / hflights$ActualElapsedTime
houstonf <- cbind(houstonf, MilesperMin)
head(houstonf)
##      Year Month DayofMonth DayOfWeek UniqueCarrier Distance MilesperMin
## 5424 2011     1          1         6            AA      224    3.733333
## 5425 2011     1          2         7            AA      224    3.733333
## 5426 2011     1          3         1            AA      224    3.200000
## 5427 2011     1          4         2            AA      224    3.200000
## 5428 2011     1          5         3            AA      224    3.612903
## 5429 2011     1          6         4            AA      224    3.500000

Since our question only deals with American Airlines, we can filter all other Carriers out and remove NA’s from the ActualElapsedTime and MilesperMin field since they are likely flights that never left ground and will affect our analysis.

houstonf <- subset(houstonf, UniqueCarrier == "AA")
houstonf <- subset(houstonf, !is.na(Distance))
houstonf <- subset(houstonf, !is.na(MilesperMin))
head(houstonf)
##      Year Month DayofMonth DayOfWeek UniqueCarrier Distance MilesperMin
## 5424 2011     1          1         6            AA      224    3.733333
## 5425 2011     1          2         7            AA      224    3.733333
## 5426 2011     1          3         1            AA      224    3.200000
## 5427 2011     1          4         2            AA      224    3.200000
## 5428 2011     1          5         3            AA      224    3.612903
## 5429 2011     1          6         4            AA      224    3.500000

A simple way to get statistics on this data frame is to use the summary function.

summary(houstonf)
##       Year          Month          DayofMonth      DayOfWeek    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000  
##  Mean   :2011   Mean   : 6.583   Mean   :15.73   Mean   :3.979  
##  3rd Qu.:2011   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000  
##  UniqueCarrier         Distance      MilesperMin   
##  Length:3178        Min.   :224.0   Min.   :1.185  
##  Class :character   1st Qu.:224.0   1st Qu.:3.294  
##  Mode  :character   Median :224.0   Median :3.672  
##                     Mean   :484.8   Mean   :4.640  
##                     3rd Qu.:964.0   3rd Qu.:6.741  
##                     Max.   :964.0   Max.   :8.169

But we actually want to see information this using graphs to make it more appealing for users, especially if we have to show it off. Let’s import the ggplot2 package.

require(ggplot2)
## Loading required package: ggplot2

If we look at the boxplot for Distance, we can see by the summary that we’re not going to get much out. So the MilesperMin will stand in to show the average speed of all flights in 2011 by American Airlines out of Houston.

I prefer the violin view to give a better sense of the densities.

ggplot(houstonf, aes(y = MilesperMin, x = DayOfWeek)) + geom_violin()

But it doesn’t quite answer our initial question. Next we’ll try the density plot.

Getting closer, but not quite answering the question. We’ll need to filter only for May 2011, and show the plot by days. From this plot we can see much simpler that Day 5 (Friday) had the fastest flight, while Day 1 (Monday) had the slowest flight.

houstonfMay <- subset(houstonf, Month == 5)
ggplot(houstonfMay, aes(x = DayOfWeek, y = MilesperMin)) + geom_point(aes(color = DayofMonth))

Answer: Saturday (week day 6) had the fastest flight in May, though Tuesday had more flights (a higher density of flights) with faster speeds than any other day of the week.

To get our second answer (comparing May’s speed to the rest of the year), ggplot2 once again comes in handy.

ggplot(houstonf, aes(x = DayOfWeek, y = MilesperMin)) + geom_point(aes(color = houstonf$DayofMonth)) + facet_wrap(~Month)