Description of the Assignment

This project is where you show off your ability to (1) use R packages, (2) change the shape of data in a data frame, and (3) provide basic summary statistics and graphics as part of your exploratory data analysis.

Brief description of the assigned dataset

This dataset contains all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby). The data comes from the Research and Innovation Technology Administration at the Bureau of Transporation statistics.

Exploratory Data Analysis Questions

  1. TOP 10
    • What were the top 10 airport destinations?

    • What were their average departure delay?

    • How did the departure and arrival delays look graphically? Hypothesis: there’s direct relation

  2. Extremes
    • What were the dates with most flights? Hypothesis: related to holidays

    • What were the dates with most cancellations? Hypothesis: related to weather

List the libraries to be used and attach hflights

library("ggplot2")
library("hflights")
library("plyr")

attach(hflights)

Find and Plot the Top 10 destinations

destination <- arrange(count(Dest,"Dest"),desc(freq))[1:10,]

# makes the y-axis labels in Top 10 list descending
destination$Dest <- factor(destination$Dest, levels = destination$Dest[order(destination$freq)])

ggplot(destination, aes(x = freq, y = Dest)) + 
      geom_point() +
       xlab("\n Total Number of Flights") +
       ylab("Airport Destination \n") +
       ggtitle("2011 Top 10 Destinations \n")

Find and Plot the average delay of the Top 10 destinations

Delays <- subset(hflights, ArrDelay > 0 & DepDelay > 0, select =c(Dest, ArrDelay, DepDelay) )
Delays <- na.omit(Delays)
Delays_mean <- aggregate(DepDelay ~ Dest, Delays, mean)
Delays10 <- Delays_mean[Delays_mean$Dest %in% destination$Dest, ]

# maintains the Top 10 list descending
Delays10$Dest <- factor(Delays10$Dest, levels = destination$Dest[order(destination$freq, decreasing = TRUE)])

ggplot(Delays10, aes(x = Dest, y = DepDelay)) +
       geom_bar(stat = "identity")+
       xlab("\n Airport Destination") +
       ylab("Average Delay in Minutes \n") +
       ggtitle("Average Delay of the Top 10 Destinations \n")

Plot the relationship between Departure and Arrival delays of the Top 10 Destinations

DelaysArrDep <- Delays[Delays$Dest %in% destination$Dest, ]

# maintains the Top 10 list descending
DelaysArrDep$Dest <- factor(DelaysArrDep$Dest, levels = destination$Dest[order(destination$freq, decreasing = TRUE)])

ggplot(DelaysArrDep, aes(DepDelay,ArrDelay)) + 
  geom_point() +
  facet_wrap(~Dest) +
  xlab("\n Departure Delay in minutes") +
  ylab("Arrival Delay in minutes \n") +
  ggtitle("Relationship Between Departure and Arrival delays")

Hypothesis Verified: As can be expected, there appears a direct relationship between the two delays.

Find the dates with most flights, list descending

TallyDate <- aggregate(cbind(count = FlightNum) ~ (Month + DayofMonth), 
                            data = hflights, 
                            FUN = function(x){NROW(x)})
BusiestDate <- TallyDate[order(TallyDate$count,decreasing = TRUE),]
head(BusiestDate)
##     Month DayofMonth count
## 44      8          4   706
## 128     8         11   706
## 140     8         12   706
## 56      8          5   705
## 32      8          3   704
## 116     8         10   704

Hypothesis Verified: The busiest dates appears to be related to summer travels.

Find the dates with most cancellations, list descending

flightsdf <- subset(hflights, Cancelled > 0,select=c(Month, DayofMonth, DayOfWeek, FlightNum))
MostcancelledDate <- aggregate(cbind(count = FlightNum) ~ (Month + DayofMonth), 
                        data = flightsdf, 
                        FUN = function(x){NROW(x)})
CancelledDate <- MostcancelledDate[order(MostcancelledDate$count,decreasing = TRUE),]
head(CancelledDate)
##     Month DayofMonth count
## 37      2          4   388
## 25      2          3   256
## 2       2          1   152
## 14      2          2    76
## 87      2          9    74
## 121     5         12    67

Hypothesis Verified: The dates with most cancellations appears to be related to winter season.

detach(hflights)