Question: What day (as name, not date) in May 2011 did American Airlines fly the fastest? And how did May compare to the rest of the year?
For our analysis of the hflights database, we’ll need to install the hflights package and check the headers.
install.packages("hflights")
require(hflights)
## Loading required package: hflights
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
To answer our question, we’ll only need a few columns so we’ll create a new data frame with only certain columns.
flight1 <- hflights[1:4]
flight2 <- hflights[7]
flight3 <- hflights[16]
houstonf <- cbind(flight1, flight2, flight3)
head(houstonf)
## Year Month DayofMonth DayOfWeek UniqueCarrier Distance
## 5424 2011 1 1 6 AA 224
## 5425 2011 1 2 7 AA 224
## 5426 2011 1 3 1 AA 224
## 5427 2011 1 4 2 AA 224
## 5428 2011 1 5 3 AA 224
## 5429 2011 1 6 4 AA 224
Now we’ll add a new column to see how the ActualElapsedTime and Distance connect as an MileperMin column, this can also be thought of more commonly as “speed”.
MilesperMin <- hflights$Distance / hflights$ActualElapsedTime
houstonf <- cbind(houstonf, MilesperMin)
head(houstonf)
## Year Month DayofMonth DayOfWeek UniqueCarrier Distance MilesperMin
## 5424 2011 1 1 6 AA 224 3.733333
## 5425 2011 1 2 7 AA 224 3.733333
## 5426 2011 1 3 1 AA 224 3.200000
## 5427 2011 1 4 2 AA 224 3.200000
## 5428 2011 1 5 3 AA 224 3.612903
## 5429 2011 1 6 4 AA 224 3.500000
Since our question only deals with American Airlines, we can filter all other Carriers out and remove NA’s from the ActualElapsedTime and MilesperMin field since they are likely flights that never left ground and will affect our analysis.
houstonf <- subset(houstonf, UniqueCarrier == "AA")
houstonf <- subset(houstonf, !is.na(Distance))
houstonf <- subset(houstonf, !is.na(MilesperMin))
head(houstonf)
## Year Month DayofMonth DayOfWeek UniqueCarrier Distance MilesperMin
## 5424 2011 1 1 6 AA 224 3.733333
## 5425 2011 1 2 7 AA 224 3.733333
## 5426 2011 1 3 1 AA 224 3.200000
## 5427 2011 1 4 2 AA 224 3.200000
## 5428 2011 1 5 3 AA 224 3.612903
## 5429 2011 1 6 4 AA 224 3.500000
A simple way to get statistics on this data frame is to use the summary function.
summary(houstonf)
## Year Month DayofMonth DayOfWeek
## Min. :2011 Min. : 1.000 Min. : 1.00 Min. :1.000
## 1st Qu.:2011 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000
## Median :2011 Median : 7.000 Median :16.00 Median :4.000
## Mean :2011 Mean : 6.583 Mean :15.73 Mean :3.979
## 3rd Qu.:2011 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:6.000
## Max. :2011 Max. :12.000 Max. :31.00 Max. :7.000
## UniqueCarrier Distance MilesperMin
## Length:3178 Min. :224.0 Min. :1.185
## Class :character 1st Qu.:224.0 1st Qu.:3.294
## Mode :character Median :224.0 Median :3.672
## Mean :484.8 Mean :4.640
## 3rd Qu.:964.0 3rd Qu.:6.741
## Max. :964.0 Max. :8.169
But we actually want to see information this using graphs to make it more appealing for users, especially if we have to show it off. Let’s import the ggplot2 package.
require(ggplot2)
## Loading required package: ggplot2
If we look at the boxplot for Distance, we can see by the summary that we’re not going to get much out. So the MilesperMin will stand in to show the average speed of all flights in 2011 by American Airlines out of Houston.
I prefer the violin view to give a better sense of the densities.
ggplot(houstonf, aes(y = MilesperMin, x = DayOfWeek)) + geom_violin()
But it doesn’t quite answer our initial question. Next we’ll try the density plot.
Getting closer, but not quite answering the question. We’ll need to filter only for May 2011, and show the plot by days. From this plot we can see much simpler that Day 5 (Friday) had the fastest flight, while Day 1 (Monday) had the slowest flight.
houstonfMay <- subset(houstonf, Month == 5)
ggplot(houstonfMay, aes(x = DayOfWeek, y = MilesperMin)) + geom_point(aes(color = DayofMonth))
Answer: Saturday (week day 6) had the fastest flight in May, though Tuesday had more flights (a higher density of flights) with faster speeds than any other day of the week.
To get our second answer (comparing May’s speed to the rest of the year), ggplot2 once again comes in handy.
ggplot(houstonf, aes(x = DayOfWeek, y = MilesperMin)) + geom_point(aes(color = houstonf$DayofMonth)) + facet_wrap(~Month)