Luke Hoefer, Cris Mascarenhas, Anirudh Manjesh
The dataset in use was the nycflights13 dataset, which contains detailed information about all flights that departed from New York City in 2013. Specifically, it includes flights leaving from the three major NYC airports: JFK, LGA, and EWR.It includes information such as the date of each flight, scheduled and actual departure and arrival times, airline carrier codes, flight numbers, plane tail numbers, origin and destination airports, airtime, distance traveled, and scheduled departure times broken down by hour and minute.
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
A linear relationship is demonstrated between variables of Distance and Air Time, with increases in distances corresponding to increases in Air Time. This relationship is mirrored in the three origin points of the EWR, JFK, and LGA airports.
Total flights between the three airports is well distributed, with the Newark Liberty International Airport (EWR) possessing the most flights over the course of 2013 (117127 flights). Average delay is also displayed on the pie chart.
The most flights are taken during the summer months of August and July, with a monthly average of 28064.67 flights per month.
Remove data points that are outliers based on the Inter-Quartile Range (IQR) and reformat the flights dataset based on the upper and lower bounds.
flights_sample = flights[sample(1:nrow(flights), 1000, replace = FALSE), ] # Use a sample for simplicity
Q1 = quantile(flights_sample[["dep_time"]], 0.25, na.rm=TRUE)
Q3 = quantile(flights_sample[["dep_time"]], 0.75, na.rm=TRUE)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR
flights_sample = flights_sample %>% filter(dep_delay >= lower_bound & dep_delay <= upper_bound)
cat("Lower Bound:", lower_bound, "\n", "IQR", IQR, "\n", "Upper Bound:", upper_bound)## Lower Bound: -284.625
## IQR 816.25
## Upper Bound: 2980.375
As demonstrated within the graph, there is a quasi-linear relationship between departure time and arrival time. Analysis of the color bar further reveals that increases in distance results in a longer travel time, which is to be expected because longer flights take longer to complete.
Departure delays are categorized by flight carrier. The mean departure delay is marked with the dashed red line, and the median departure delay in blue. As seen, the much larger mean value indicates that there likely exists a number of large outliers in the positive departure delay direction that is skewing the data.
We conducted a one-way ANOVA to determine if mean departure delay significantly differs across the top 5 airlines by number of flights.
## Df Sum Sq Mean Sq F value Pr(>F)
## carrier 15 6348034 423202 264.9 <2e-16 ***
## Residuals 328505 524819199 1598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As demonstrated, the dataset has a p-value of 2*10^-16, indicating the set possesses a signficant relationship between the carrier and number of flights taken. The degrees of freedom (15) indicates that there are 16 total carriers at play.