Background

The purpose of this project is to perform manipulation and basic analysis of a data.frame in R. To accomplish this, I will be utilizing the nycflights13 package, which contains information about all flights leaving New York City in 2013. The full documentation for the package is available here.

The flights dataset in the nycflights13 package contains the on-time information about all of the 2013 flights. This is loaded into memory as nyc_flights:

nyc_flights <- flights
str(nyc_flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : chr  "UA" "UA" "AA" "B6" ...
##  $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

Data Manipulation

The date of the flights is stored in separate columns; these are converted to a date format recognized by R into a new date column:

nyc_flights$date <- as.Date(paste(nyc_flights$year,
                                  nyc_flights$month,
                                  nyc_flights$day, sep = "-"))

The carrier, tailnum, origin, and dest columns are stored as characters, but will be more useful for exploratory analysis as factors:

nyc_flights$carrier <- factor(nyc_flights$carrier)
nyc_flights$tailnum <- factor(nyc_flights$tailnum)
nyc_flights$origin  <- factor(nyc_flights$origin)
nyc_flights$dest    <- factor(nyc_flights$dest)

The airlines, airports, and planes datasets included in the nycflights13 package include metadata about the carrier, dest, and tailnum fields in the flights table. The information from these tables is incorporated into nyc_flights using the dplyr package:

require(dplyr)
# join airports dataset on faa code for destinations
nyc_flights <- inner_join(nyc_flights, airports, by = c("dest" = "faa"))
# change ambiguous airport 'name' column
names(nyc_flights)[names(nyc_flights) == "name"]<- "dest_name"
# join airlines dataset on 2-character carrier code
nyc_flights <- inner_join(nyc_flights, airlines, by = "carrier")
# change ambiguous airline 'name' column
names(nyc_flights)[names(nyc_flights) == "name"] <- "airline"

The weather dataset included in the package includes hourly data for the three NYC airports. To link this data to nyc_flights, a left_join is used in case there is missing weather data:

nyc_flights <- left_join(nyc_flights, weather, by = c("year", "month", "day", "hour", "origin"))
str(nyc_flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    329174 obs. of  33 variables:
##  $ year      : num  2013 2013 2013 2013 2013 ...
##  $ month     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time  : int  517 533 542 554 554 555 557 557 558 558 ...
##  $ dep_delay : num  2 4 2 -6 -4 -5 -3 -3 -2 -2 ...
##  $ arr_time  : int  830 850 923 812 740 913 709 838 753 849 ...
##  $ arr_delay : num  11 20 33 -25 12 19 -14 -8 8 -2 ...
##  $ carrier   : chr  "UA" "UA" "AA" "DL" ...
##  $ tailnum   : Factor w/ 4044 levels "","D942DN","N0EGMQ",..: 181 525 2402 2662 1143 1830 3301 2208 1179 3164 ...
##  $ flight    : int  1545 1714 1141 461 1696 507 5708 79 301 49 ...
##  $ origin    : chr  "EWR" "LGA" "JFK" "LGA" ...
##  $ dest      : chr  "IAH" "IAH" "MIA" "ATL" ...
##  $ air_time  : num  227 227 160 116 150 158 53 140 138 149 ...
##  $ distance  : num  1400 1416 1089 762 719 ...
##  $ hour      : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute    : num  17 33 42 54 54 55 57 57 58 58 ...
##  $ date      : Date, format: "2013-01-01" "2013-01-01" ...
##  $ dest_name : chr  "George Bush Intercontinental" "George Bush Intercontinental" "Miami Intl" "Hartsfield Jackson Atlanta Intl" ...
##  $ lat       : num  30 30 25.8 33.6 42 ...
##  $ lon       : num  -95.3 -95.3 -80.3 -84.4 -87.9 ...
##  $ alt       : int  97 97 8 1026 668 9 313 96 668 19 ...
##  $ tz        : num  -6 -6 -5 -5 -6 -5 -5 -5 -6 -5 ...
##  $ dst       : chr  "A" "A" "A" "A" ...
##  $ airline   : Factor w/ 1571 levels "40-Mile Air",..: 1475 1475 268 554 1475 837 635 837 268 837 ...
##  $ temp      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ dewp      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ humid     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ wind_dir  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ wind_speed: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ wind_gust : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ precip    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pressure  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ visib     : num  NA NA NA NA NA NA NA NA NA NA ...

Following the joins, the character columns above are again changed to factors.

Inspecting the resulting data.frame, it appears that data is largely unavailable for JFK and LGA airports. To allow for investigation of the impact of weather, a subset of the nyc_flights dataset is created for flights departing EWR, containing only the relevant columns and excluding NA values:

EWR_flights <- na.omit(subset(nyc_flights, origin == "EWR", c(arr_delay, temp:visib)))

Exploratory Analysis

Exploration of the data contained in nyc_flights and EWR_flights can be performed using graphics.

What is the distribution of flight delay times?

Since the arrival delay is the delay that passengers likely measure their flights’ on-time behavior against, this is investigated using ggplot2:

This plot shows a very strong concentration of points between roughly -100 and 250 minutes, with the highest concentration between -15 and 0 minutes. There is a very long, low tail extending out to over 1250 minutes. To investigate if these points are outliers, a boxplot is created with outliers highlighted in red:

From this boxplot, it is clear that all observations of greater than 200 minutes are outliers, so the histogram is recreated excluding these points. With the “zooming-in” on the histogram, narrower bins are used to view the data with increased granularity.

From this scatterplot, it can be seen that a very large portion of flights arrive between 30 minutes early and five minutes late, with the highest frequency between 20 minutes early and exactly on time. The fact that such a high portion of flights arrive early may be by design, as airlines may add additional time into their scheduled flight duration so that they can report early arrivals.

To investigate if there is any difference between airports, the previous plot is broken down by origin airport. The plot shows that the three airports have remarkably similar distributions of flight delays:

How does weather affect flight delays?

Since weather data is available for most flights (98984) departing Newark Liberty International Airport (EWR), the relationship between weather at the origin airport and arrival delay can be studied for this dataset. It should be noted that all of the investigations of weather below do not take weather at the destination airport into account, which would likely have a strong affect on the on-time performance of flights.

In order to determine which weather variables are most likely to impact flights’ delays, a correlation matrix is calculated for EWR_flights:

##               arr_delay        temp        dewp       humid     wind_dir
## arr_delay   1.000000000  0.05298780  0.03377885 -0.02233197  0.003887528
## temp        0.052987803  1.00000000  0.89028301  0.02010170 -0.057099308
## dewp        0.033778854  0.89028301  1.00000000  0.46482733 -0.191026305
## humid      -0.022331972  0.02010170  0.46482733  1.00000000 -0.319185691
## wind_dir    0.003887528 -0.05709931 -0.19102631 -0.31918569  1.000000000
## wind_speed  0.026581471 -0.03505389 -0.08381286 -0.11968245  0.135519301
## wind_gust   0.026581471 -0.03505389 -0.08381286 -0.11968245  0.135519301
## precip      0.045772203 -0.04165575  0.03082298  0.18009024 -0.065580967
## pressure   -0.118258567 -0.25537038 -0.28979989 -0.15773960 -0.213537212
## visib      -0.086241929  0.05577439 -0.11938893 -0.44567063  0.214343604
##              wind_speed    wind_gust       precip    pressure       visib
## arr_delay   0.026581471  0.026581471  0.045772203 -0.11825857 -0.08624193
## temp       -0.035053892 -0.035053892 -0.041655747 -0.25537038  0.05577439
## dewp       -0.083812865 -0.083812865  0.030822977 -0.28979989 -0.11938893
## humid      -0.119682454 -0.119682454  0.180090245 -0.15773960 -0.44567063
## wind_dir    0.135519301  0.135519301 -0.065580967 -0.21353721  0.21434360
## wind_speed  1.000000000  1.000000000  0.006582175 -0.08334562  0.04562342
## wind_gust   1.000000000  1.000000000  0.006582175 -0.08334562  0.04562342
## precip      0.006582175  0.006582175  1.000000000 -0.08969011 -0.34220490
## pressure   -0.083345623 -0.083345623 -0.089690114  1.00000000  0.10896725
## visib       0.045623422  0.045623422 -0.342204903  0.10896725  1.00000000

The first column of this correlation matrix contains the corelations between each variable and arr_delay. The low values in this column indicated that there is a very weak relationship between weather at the origin airport and the flight delay. For the sake of completion, however, the relationship is still analyzed.

By sorting the first column according to the absolute value of the correlation, the variables with the strongest relationship with arr_delay can be identified. Since arr_delay will necessarily have a perfect correlation with itself, the second- and third-highest values returned will be those with the strongest correlation with arr_delay:

sort(abs(cor(EWR_flights)[, 1]), decreasing = TRUE)[2:3]
##   pressure      visib 
## 0.11825857 0.08624193

Since pressure has the strongest correlation, it will be plotted against arr_delay, with visib used to determine the color of the points:

As expected, this scatterplot shows a very weak relationship between pressure and arr_delay, and the coloring of the scatter indicates that most visib points lie at or around 10 miles. This is further shown by calculating the linear relationship between the two variables:

summary(lm(EWR_flights$arr_delay ~ EWR_flights$pressure))
## 
## Call:
## lm(formula = EWR_flights$arr_delay ~ EWR_flights$pressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -87.80  -23.23  -10.99    7.55 1110.10 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          706.80971   18.67080   37.86   <2e-16 ***
## EWR_flights$pressure  -0.68729    0.01834  -37.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.98 on 98982 degrees of freedom
## Multiple R-squared:  0.01399,    Adjusted R-squared:  0.01398 
## F-statistic:  1404 on 1 and 98982 DF,  p-value: < 2.2e-16

The R-sqared value for this linear relationship is 0.01399, which represents essentially no relationship.