The purpose of this project is to perform manipulation and basic analysis of a data.frame
in R. To accomplish this, I will be utilizing the nycflights13
package, which contains information about all flights leaving New York City in 2013. The full documentation for the package is available here.
The flights
dataset in the nycflights13
package contains the on-time information about all of the 2013 flights. This is loaded into memory as nyc_flights
:
nyc_flights <- flights
str(nyc_flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 16 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 5 5 5 5 5 5 ...
## $ minute : num 17 33 42 44 54 54 55 57 57 58 ...
The date of the flights is stored in separate columns; these are converted to a date format recognized by R into a new date
column:
nyc_flights$date <- as.Date(paste(nyc_flights$year,
nyc_flights$month,
nyc_flights$day, sep = "-"))
The carrier
, tailnum
, origin
, and dest
columns are stored as characters, but will be more useful for exploratory analysis as factors:
nyc_flights$carrier <- factor(nyc_flights$carrier)
nyc_flights$tailnum <- factor(nyc_flights$tailnum)
nyc_flights$origin <- factor(nyc_flights$origin)
nyc_flights$dest <- factor(nyc_flights$dest)
The airlines
, airports
, and planes
datasets included in the nycflights13
package include metadata about the carrier
, dest
, and tailnum
fields in the flights
table. The information from these tables is incorporated into nyc_flights
using the dplyr
package:
require(dplyr)
# join airports dataset on faa code for destinations
nyc_flights <- inner_join(nyc_flights, airports, by = c("dest" = "faa"))
# change ambiguous airport 'name' column
names(nyc_flights)[names(nyc_flights) == "name"]<- "dest_name"
# join airlines dataset on 2-character carrier code
nyc_flights <- inner_join(nyc_flights, airlines, by = "carrier")
# change ambiguous airline 'name' column
names(nyc_flights)[names(nyc_flights) == "name"] <- "airline"
The weather
dataset included in the package includes hourly data for the three NYC airports. To link this data to nyc_flights
, a left_join
is used in case there is missing weather data:
nyc_flights <- left_join(nyc_flights, weather, by = c("year", "month", "day", "hour", "origin"))
str(nyc_flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 329174 obs. of 33 variables:
## $ year : num 2013 2013 2013 2013 2013 ...
## $ month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 554 554 555 557 557 558 558 ...
## $ dep_delay : num 2 4 2 -6 -4 -5 -3 -3 -2 -2 ...
## $ arr_time : int 830 850 923 812 740 913 709 838 753 849 ...
## $ arr_delay : num 11 20 33 -25 12 19 -14 -8 8 -2 ...
## $ carrier : chr "UA" "UA" "AA" "DL" ...
## $ tailnum : Factor w/ 4044 levels "","D942DN","N0EGMQ",..: 181 525 2402 2662 1143 1830 3301 2208 1179 3164 ...
## $ flight : int 1545 1714 1141 461 1696 507 5708 79 301 49 ...
## $ origin : chr "EWR" "LGA" "JFK" "LGA" ...
## $ dest : chr "IAH" "IAH" "MIA" "ATL" ...
## $ air_time : num 227 227 160 116 150 158 53 140 138 149 ...
## $ distance : num 1400 1416 1089 762 719 ...
## $ hour : num 5 5 5 5 5 5 5 5 5 5 ...
## $ minute : num 17 33 42 54 54 55 57 57 58 58 ...
## $ date : Date, format: "2013-01-01" "2013-01-01" ...
## $ dest_name : chr "George Bush Intercontinental" "George Bush Intercontinental" "Miami Intl" "Hartsfield Jackson Atlanta Intl" ...
## $ lat : num 30 30 25.8 33.6 42 ...
## $ lon : num -95.3 -95.3 -80.3 -84.4 -87.9 ...
## $ alt : int 97 97 8 1026 668 9 313 96 668 19 ...
## $ tz : num -6 -6 -5 -5 -6 -5 -5 -5 -6 -5 ...
## $ dst : chr "A" "A" "A" "A" ...
## $ airline : Factor w/ 1571 levels "40-Mile Air",..: 1475 1475 268 554 1475 837 635 837 268 837 ...
## $ temp : num NA NA NA NA NA NA NA NA NA NA ...
## $ dewp : num NA NA NA NA NA NA NA NA NA NA ...
## $ humid : num NA NA NA NA NA NA NA NA NA NA ...
## $ wind_dir : num NA NA NA NA NA NA NA NA NA NA ...
## $ wind_speed: num NA NA NA NA NA NA NA NA NA NA ...
## $ wind_gust : num NA NA NA NA NA NA NA NA NA NA ...
## $ precip : num NA NA NA NA NA NA NA NA NA NA ...
## $ pressure : num NA NA NA NA NA NA NA NA NA NA ...
## $ visib : num NA NA NA NA NA NA NA NA NA NA ...
Following the joins, the character columns above are again changed to factors.
Inspecting the resulting data.frame
, it appears that data is largely unavailable for JFK and LGA airports. To allow for investigation of the impact of weather, a subset of the nyc_flights
dataset is created for flights departing EWR, containing only the relevant columns and excluding NA
values:
EWR_flights <- na.omit(subset(nyc_flights, origin == "EWR", c(arr_delay, temp:visib)))
Exploration of the data contained in nyc_flights
and EWR_flights
can be performed using graphics.
Since the arrival delay is the delay that passengers likely measure their flights’ on-time behavior against, this is investigated using ggplot2
:
This plot shows a very strong concentration of points between roughly -100 and 250 minutes, with the highest concentration between -15 and 0 minutes. There is a very long, low tail extending out to over 1250 minutes. To investigate if these points are outliers, a boxplot is created with outliers highlighted in red:
From this boxplot, it is clear that all observations of greater than 200 minutes are outliers, so the histogram is recreated excluding these points. With the “zooming-in” on the histogram, narrower bins are used to view the data with increased granularity.
From this scatterplot, it can be seen that a very large portion of flights arrive between 30 minutes early and five minutes late, with the highest frequency between 20 minutes early and exactly on time. The fact that such a high portion of flights arrive early may be by design, as airlines may add additional time into their scheduled flight duration so that they can report early arrivals.
To investigate if there is any difference between airports, the previous plot is broken down by origin
airport. The plot shows that the three airports have remarkably similar distributions of flight delays:
Since weather data is available for most flights (98984) departing Newark Liberty International Airport (EWR), the relationship between weather at the origin airport and arrival delay can be studied for this dataset. It should be noted that all of the investigations of weather below do not take weather at the destination airport into account, which would likely have a strong affect on the on-time performance of flights.
In order to determine which weather variables are most likely to impact flights’ delays, a correlation matrix is calculated for EWR_flights
:
## arr_delay temp dewp humid wind_dir
## arr_delay 1.000000000 0.05298780 0.03377885 -0.02233197 0.003887528
## temp 0.052987803 1.00000000 0.89028301 0.02010170 -0.057099308
## dewp 0.033778854 0.89028301 1.00000000 0.46482733 -0.191026305
## humid -0.022331972 0.02010170 0.46482733 1.00000000 -0.319185691
## wind_dir 0.003887528 -0.05709931 -0.19102631 -0.31918569 1.000000000
## wind_speed 0.026581471 -0.03505389 -0.08381286 -0.11968245 0.135519301
## wind_gust 0.026581471 -0.03505389 -0.08381286 -0.11968245 0.135519301
## precip 0.045772203 -0.04165575 0.03082298 0.18009024 -0.065580967
## pressure -0.118258567 -0.25537038 -0.28979989 -0.15773960 -0.213537212
## visib -0.086241929 0.05577439 -0.11938893 -0.44567063 0.214343604
## wind_speed wind_gust precip pressure visib
## arr_delay 0.026581471 0.026581471 0.045772203 -0.11825857 -0.08624193
## temp -0.035053892 -0.035053892 -0.041655747 -0.25537038 0.05577439
## dewp -0.083812865 -0.083812865 0.030822977 -0.28979989 -0.11938893
## humid -0.119682454 -0.119682454 0.180090245 -0.15773960 -0.44567063
## wind_dir 0.135519301 0.135519301 -0.065580967 -0.21353721 0.21434360
## wind_speed 1.000000000 1.000000000 0.006582175 -0.08334562 0.04562342
## wind_gust 1.000000000 1.000000000 0.006582175 -0.08334562 0.04562342
## precip 0.006582175 0.006582175 1.000000000 -0.08969011 -0.34220490
## pressure -0.083345623 -0.083345623 -0.089690114 1.00000000 0.10896725
## visib 0.045623422 0.045623422 -0.342204903 0.10896725 1.00000000
The first column of this correlation matrix contains the corelations between each variable and arr_delay
. The low values in this column indicated that there is a very weak relationship between weather at the origin airport and the flight delay. For the sake of completion, however, the relationship is still analyzed.
By sorting the first column according to the absolute value of the correlation, the variables with the strongest relationship with arr_delay
can be identified. Since arr_delay
will necessarily have a perfect correlation with itself, the second- and third-highest values returned will be those with the strongest correlation with arr_delay
:
sort(abs(cor(EWR_flights)[, 1]), decreasing = TRUE)[2:3]
## pressure visib
## 0.11825857 0.08624193
Since pressure
has the strongest correlation, it will be plotted against arr_delay
, with visib
used to determine the color of the points:
As expected, this scatterplot shows a very weak relationship between pressure
and arr_delay
, and the coloring of the scatter indicates that most visib
points lie at or around 10 miles. This is further shown by calculating the linear relationship between the two variables:
summary(lm(EWR_flights$arr_delay ~ EWR_flights$pressure))
##
## Call:
## lm(formula = EWR_flights$arr_delay ~ EWR_flights$pressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.80 -23.23 -10.99 7.55 1110.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 706.80971 18.67080 37.86 <2e-16 ***
## EWR_flights$pressure -0.68729 0.01834 -37.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.98 on 98982 degrees of freedom
## Multiple R-squared: 0.01399, Adjusted R-squared: 0.01398
## F-statistic: 1404 on 1 and 98982 DF, p-value: < 2.2e-16
The R-sqared value for this linear relationship is 0.01399, which represents essentially no relationship.