Setup:

Here I am installing the necessary packages and loading the required libraries.

# Load standard libraries
#install.packages("tidyverse")
#install.packages("nycflights13")
#install.packages("ggplot2")
#install.packages("dplyr")
library(tidyverse)
library(nycflights13)
library(ggplot2)
library(dplyr)

Problem 1: Exploring the NYC Flights Data

In this problem I am using the data on all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

(a) Importing and Inspecting Data:
  • Loading the data
  • What each variable represents
  • Performing a basic inspection of the data

  • This data was collected by the Bureau of transportation statistics of all the flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

dat_original <- as.data.frame(nycflights13::flights, row.names = NULL) #storing nycflights data into a data frame

head(dat_original) #getting a glimpse of the first few rows of the data frame
tail(dat_original)
# ?nycflights13::flights #to know the meaning of each column of the data frame
str(dat_original) #getting info of all the variables in the data frame
## 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
cc=complete.cases(dat_original)
cat("\nNumber of complete cases in the dataset ",length(cc[cc==TRUE]))
## 
## Number of complete cases in the dataset  327346
cat("\nPercentage of incomplete cases as compared to the total number of rows = ", length(cc[cc==FALSE])/nrow(dat_original)*100)
## 
## Percentage of incomplete cases as compared to the total number of rows =  2.800081
#Getting rid of incomplete cases
dat_original <- drop_na(dat_original)
dat <- dat_original
(b) Formulating Questions:

Using this data, I’d like to explore the answers to the following questions: * Which carrier has the worst performance operations? Conversely, which carrier has the best operations? * Which month in 2013 was the busiest?

(c) Exploring Data:

For each of the questions proposed above, I am performing an exploratory data analysis designed to address the question.

  • For the first question, we should try to find out the departure delay times and the arrival delay times for each of the carriers.
dat$carrier <- as.factor(dat$carrier) # we noticed that the 'carrier' column is a character type. To get carrier pecific data, we change the carrier data type to factor

dat <- merge(dat, as.data.frame(nycflights13::airlines), by="carrier") #getting the name of the carriers from the 'airlines' data in the nycflights13 package

dat$name <- as.factor(dat$name) #converting the carrier name variable into a factor variable like we did for the carrier variable
  • Here we see that there are 16 levels in the “carrier”, That is to say that there are 16 carriers. Now we can safely ddetermine the carrier-wise analyses of the flights
str(dat)
## 'data.frame':    327346 obs. of  20 variables:
##  $ carrier       : Factor w/ 16 levels "9E","AA","AS",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  12 10 8 10 7 8 12 1 4 1 ...
##  $ day           : int  2 22 20 6 4 8 11 28 7 5 ...
##  $ dep_time      : int  1556 840 2054 951 1505 2004 2114 1744 1852 1232 ...
##  $ sched_dep_time: int  1600 832 2035 1000 1459 1935 1900 1659 1900 815 ...
##  $ dep_delay     : num  -4 8 19 -9 6 29 134 45 -8 257 ...
##  $ arr_time      : int  1723 945 2156 1202 1753 2239 2323 2100 2118 1405 ...
##  $ sched_arr_time: int  1751 1000 2218 1234 1801 2142 2125 2046 2122 957 ...
##  $ arr_delay     : num  -28 -15 -22 -32 -8 57 118 14 -4 248 ...
##  $ flight        : int  3357 3492 4127 3574 3325 3443 4033 3375 3439 3521 ...
##  $ tailnum       : chr  "N928XJ" "N935XJ" "N8733G" "N908XJ" ...
##  $ origin        : chr  "LGA" "JFK" "JFK" "LGA" ...
##  $ dest          : chr  "BNA" "DCA" "IAD" "IND" ...
##  $ air_time      : num  111 45 41 95 192 66 113 223 99 133 ...
##  $ distance      : num  764 213 228 660 1391 ...
##  $ hour          : num  16 8 20 10 14 19 19 16 19 8 ...
##  $ minute        : num  0 32 35 0 59 35 0 59 0 15 ...
##  $ time_hour     : POSIXct, format: "2013-12-02 16:00:00" "2013-10-22 08:00:00" ...
##  $ name          : Factor w/ 16 levels "AirTran Airways Corporation",..: 5 5 5 5 5 5 5 5 5 5 ...
cat("\nSummary of the departure delays of each of the carriers:\n")
## 
## Summary of the departure delays of each of the carriers:
tapply(dat$dep_delay, dat$name, function(x) summary(x))
## $`AirTran Airways Corporation`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -22.00   -4.00    1.00   18.61   17.00  602.00 
## 
## $`Alaska Airlines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -21.000  -7.000  -3.000   5.831   3.000 225.000 
## 
## $`American Airlines Inc.`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -24.000   -6.000   -3.000    8.569    4.000 1014.000 
## 
## $`Delta Air Lines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -33.000  -5.000  -2.000   9.224   5.000 960.000 
## 
## $`Endeavor Air Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -24.00   -6.00   -2.00   16.44   16.00  747.00 
## 
## $`Envoy Air`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -26.00   -7.00   -3.00   10.45    9.00 1137.00 
## 
## $`ExpressJet Airlines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -32.00   -5.00   -1.00   19.84   25.00  548.00 
## 
## $`Frontier Airlines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -27.0    -4.0     0.0    20.2    18.0   853.0 
## 
## $`Hawaiian Airlines Inc.`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -16.000   -7.000   -4.000    4.901   -1.000 1301.000 
## 
## $`JetBlue Airways`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -43.00   -5.00   -1.00   12.97   12.00  502.00 
## 
## $`Mesa Airlines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -16.00   -7.00   -2.00   18.90   22.25  387.00 
## 
## $`SkyWest Airlines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -14.00   -9.00   -6.00   12.59    4.00  154.00 
## 
## $`Southwest Airlines Co.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -13.00   -2.00    1.00   17.66   17.00  471.00 
## 
## $`United Air Lines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -20.00   -4.00    0.00   12.02   11.00  483.00 
## 
## $`US Airways Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -19.000  -7.000  -4.000   3.745   0.000 500.000 
## 
## $`Virgin America`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -20.00   -4.00    0.00   12.76    8.00  653.00
cat("\nSummary of the arrival delays of each of the carriers:\n")
## 
## Summary of the arrival delays of each of the carriers:
tapply(dat$arr_delay, dat$name, function(x) summary(x))
## $`AirTran Airways Corporation`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -44.00   -7.00    5.00   20.12   24.00  572.00 
## 
## $`Alaska Airlines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -74.000 -32.000 -17.000  -9.931   2.000 198.000 
## 
## $`American Airlines Inc.`
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  -75.0000  -21.0000   -9.0000    0.3643    8.0000 1007.0000 
## 
## $`Delta Air Lines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -71.000 -20.000  -8.000   1.644   8.000 931.000 
## 
## $`Endeavor Air Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -68.00  -21.00   -7.00    7.38   15.00  744.00 
## 
## $`Envoy Air`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -53.00  -13.00   -1.00   10.77   18.00 1127.00 
## 
## $`ExpressJet Airlines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -62.0   -14.0    -1.0    15.8    26.0   577.0 
## 
## $`Frontier Airlines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -47.00   -9.00    6.00   21.92   31.00  834.00 
## 
## $`Hawaiian Airlines Inc.`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -70.000  -27.750  -13.000   -6.915    2.750 1272.000 
## 
## $`JetBlue Airways`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -71.000 -14.000  -3.000   9.458  17.000 497.000 
## 
## $`Mesa Airlines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -46.00  -16.00   -2.00   15.56   24.25  381.00 
## 
## $`SkyWest Airlines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -26.00  -16.00   -7.00   11.93    6.00  157.00 
## 
## $`Southwest Airlines Co.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -58.000 -15.000  -3.000   9.649  15.000 453.000 
## 
## $`United Air Lines Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -75.000 -18.000  -6.000   3.558  12.000 455.000 
## 
## $`US Airways Inc.`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -70.00  -15.00   -6.00    2.13    8.00  492.00 
## 
## $`Virgin America`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -86.000 -23.000  -9.000   1.764   8.000 676.000
  • To see the distribution of the departure and arrival delay times, we make a boxplot of the both the delay times.
p1 <- ggplot(dat, aes(x=reorder(name, dep_delay, FUN=median ), y=dep_delay)) + 
  geom_boxplot()+
  coord_flip()+
  labs(x="Carrier Name", y="Departure Delay (in minutes)", 
       title="Box Plot of departure delay times of all the carriers")
  

p2 <- ggplot(dat, aes(x=reorder(name, arr_delay, FUN=median ), y=arr_delay)) + 
  geom_boxplot()+
  coord_flip()+
  labs(x="Carrier Name", y="Arrival Delay (in minutes)", 
       title="Box Plot of Arrival delay times of all the carriers")

require(gridExtra)
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p1,p2,nrow=2)

  • As seen above, the box-plots are not very informative. However, we can easily day that the delay times (for both arrivals and departures) are disparate given the high number of outliers.

  • To further analyze, we are now comparing the mean delay times of each of the carriers.

#To compare the mean arrival and departure delays we use tapply and 
#store the resultsin a new data frame
depDelay.mean <- tapply(dat$dep_delay, dat$name, function(x) mean(x, na.rm = T))
arrDelay.mean <- tapply(dat$arr_delay, dat$name, function(x) mean(x, na.rm = T))

#comparing the mean departure delay data
delayData <- data.frame(carrier_name=names(arrDelay.mean),
                        dep_delay=depDelay.mean, arr_delay=arrDelay.mean)


#plotting the mean departure delay of all the carriers in ascending order
depDelayPlot <- ggplot(delayData, aes(x=reorder(carrier_name, -dep_delay), y=dep_delay))+
  geom_bar(stat = "identity")+ 
  coord_flip()+
  labs(x="Carrier Names", 
       y="Avg departure delay time (in minutes)", 
       title="Average departure delay times for all the carriers")

#plotting the mean arrival delay of all the carriers in ascending order
arrDelayPlot <- ggplot(delayData, aes(x=reorder(carrier_name, -arr_delay), y=arr_delay))+
  geom_bar(stat = "identity") + 
  coord_flip() +
  labs(x="Carrier Names", 
       y="Avg arrival delay time (in minutes)", 
       title="Average arrival delay times for all the carriers")


grid.arrange(depDelayPlot, arrDelayPlot, nrow=2)

  • Thus, from the above plots it is clear that Frontier Airlines Inc is the worst performing airlines given its high departure and delay times.

  • Hawaiian Airlines and Alaska airlines have the best operational performance.

  • Also, it is worth noting that most of the flight arrive at their destination before their scheduled time or they arive within 10 minutes of their scheduled arrival time. Given the first graph, we can say that even if the flights tend to depart a little later than their scheduled arrival time, it is highly likely that the flight will not arrive later than 10 minutes of its scheduled departure time.

  • For the second question, we analyze the flight performance across the entire year of 2013 to determine what time of the year is the busiest for the airline carrriers.

dat$month <- as.factor(dat$month)#, levels=1:12, labels=c("Jan", "Feb", "Mar",
                                  #         "Apr", "May", "Jun",
                                   #        "Jul", "Aug", "Sep",
                                    #       "Oct", "Nov", "Dec"))
head(dat)
depDelay.month.mean <- tapply(dat$dep_delay, dat$month, function(x) mean(x, na.rm = T))
arrDelay.month.mean <- tapply(dat$arr_delay, dat$month, function(x) mean(x, na.rm = T))
monthDelayData <- data.frame(month=names(arrDelay.month.mean),
                        dep_delay=depDelay.month.mean, arr_delay=arrDelay.month.mean)

ggplot(dat) +
  geom_point(aes(x=reorder(month, -dep_delay), y=dep_delay,
                 colour="Departure Delay"), position = "jitter")+
  geom_point(aes(x=reorder(month, -arr_delay), y=arr_delay, colour="Arrival Delay"),
                  alpha=0.3, position = "jitter")+
  labs(x="Month", 
       y="Total departure/arrival delay time (in minutes)", 
       title="Total departure/arrival delay times for all the months")+
  scale_colour_manual("Delay (minutes)", 
                      values = c("Departure Delay"="blue", "Arrival Delay"="light green"))

* From this plot it is clear that the maximum number of delays happened in the 7th month or July.

  • Going a step further, we now plot the mean delay across all the months.
ggplot(monthDelayData)+
  geom_line(aes(x=reorder(month, -dep_delay), y=dep_delay, 
                 colour="Departure Delay", group=1))+
  geom_line(aes(x=reorder(month, -arr_delay), y=arr_delay, 
                 colour="Arrival Delay", group=1))+
  labs(x="Month", 
       y="Avg departure/arrival delay time (in minutes)", 
       title="Average departure/arrival delay times for all the months")+
  scale_colour_manual("Delay (minutes)", 
                      values = c("Departure Delay"="blue", "Arrival Delay"="green"))

* From above it is clear that July was the worst for all the flights and November was the best as the departure delays were minimum during this month as well as the mean arrival delays were almost zero. That is t0 say that, almost all the flights reached their destination on time.

  • The maximum number of delays in the month of July might be related to the weather during this time of the year. Further analyses is required for this.
ggplot(dat) +
  geom_point(aes(x=month, y=dep_delay, color="Departure Delay"), 
             position = "jitter")+
  geom_point(aes(x=month, y=arr_delay, color="Arrival Delay"),
             position = "jitter", alpha=0.3)+
  facet_wrap(.~name)+
  labs(x="Month", 
       y="Total departure/arrival delay time (in minutes)", 
       title="Total departure/arrival delay times for all the months for all carriers")+
  scale_colour_manual("Delay (minutes)", 
                      values = c("Departure Delay"="blue", "Arrival Delay"="light green"))

(d) Challenging my results:

After completing the exploratory analysis here is what we need to consider:

  • These findings may not be accurate as this is just a preliminary analysis and the cause of the findings is dependednt upon the further analyses.

  • Other data sets of the nycflightsdata13 might be required to analyze the trends completely.

  • This data might not be biased because not all the flights of all the carriers might have been recorded in the database, which may give rise to inaccuracies.