This project is where you show off your ability to (1) use R packages, (2) change the shape of data in a data frame, and (3) provide basic summary statistics and graphics as part of your exploratory data analysis.
This dataset contains all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby). The data comes from the Research and Innovation Technology Administration at the Bureau of Transporation statistics.
What were the top 10 airport destinations?
What were their average departure delay?
How did the departure and arrival delays look graphically? Hypothesis: there’s direct relation
What were the dates with most flights? Hypothesis: related to holidays
What were the dates with most cancellations? Hypothesis: related to weather
library("ggplot2")
library("hflights")
library("plyr")
attach(hflights)
destination <- arrange(count(Dest,"Dest"),desc(freq))[1:10,]
# makes the y-axis labels in Top 10 list descending
destination$Dest <- factor(destination$Dest, levels = destination$Dest[order(destination$freq)])
ggplot(destination, aes(x = freq, y = Dest)) +
geom_point() +
xlab("\n Total Number of Flights") +
ylab("Airport Destination \n") +
ggtitle("2011 Top 10 Destinations \n")
Delays <- subset(hflights, ArrDelay > 0 & DepDelay > 0, select =c(Dest, ArrDelay, DepDelay) )
Delays <- na.omit(Delays)
Delays_mean <- aggregate(DepDelay ~ Dest, Delays, mean)
Delays10 <- Delays_mean[Delays_mean$Dest %in% destination$Dest, ]
# maintains the Top 10 list descending
Delays10$Dest <- factor(Delays10$Dest, levels = destination$Dest[order(destination$freq, decreasing = TRUE)])
ggplot(Delays10, aes(x = Dest, y = DepDelay)) +
geom_bar(stat = "identity")+
xlab("\n Airport Destination") +
ylab("Average Delay in Minutes \n") +
ggtitle("Average Delay of the Top 10 Destinations \n")
DelaysArrDep <- Delays[Delays$Dest %in% destination$Dest, ]
# maintains the Top 10 list descending
DelaysArrDep$Dest <- factor(DelaysArrDep$Dest, levels = destination$Dest[order(destination$freq, decreasing = TRUE)])
ggplot(DelaysArrDep, aes(DepDelay,ArrDelay)) +
geom_point() +
facet_wrap(~Dest) +
xlab("\n Departure Delay in minutes") +
ylab("Arrival Delay in minutes \n") +
ggtitle("Relationship Between Departure and Arrival delays")
Hypothesis Verified: As can be expected, there appears a direct relationship between the two delays.
TallyDate <- aggregate(cbind(count = FlightNum) ~ (Month + DayofMonth),
data = hflights,
FUN = function(x){NROW(x)})
BusiestDate <- TallyDate[order(TallyDate$count,decreasing = TRUE),]
head(BusiestDate)
## Month DayofMonth count
## 44 8 4 706
## 128 8 11 706
## 140 8 12 706
## 56 8 5 705
## 32 8 3 704
## 116 8 10 704
Hypothesis Verified: The busiest dates appears to be related to summer travels.
flightsdf <- subset(hflights, Cancelled > 0,select=c(Month, DayofMonth, DayOfWeek, FlightNum))
MostcancelledDate <- aggregate(cbind(count = FlightNum) ~ (Month + DayofMonth),
data = flightsdf,
FUN = function(x){NROW(x)})
CancelledDate <- MostcancelledDate[order(MostcancelledDate$count,decreasing = TRUE),]
head(CancelledDate)
## Month DayofMonth count
## 37 2 4 388
## 25 2 3 256
## 2 2 1 152
## 14 2 2 76
## 87 2 9 74
## 121 5 12 67
Hypothesis Verified: The dates with most cancellations appears to be related to winter season.
detach(hflights)