2025 Midterm Project

Luke Hoefer, Cris Mascarenhas, Anirudh Manjesh

NYCFLIGHTS13 dataset

The dataset in use was the nycflights13 dataset, which contains detailed information about all flights that departed from New York City in 2013. Specifically, it includes flights leaving from the three major NYC airports: JFK, LGA, and EWR.It includes information such as the date of each flight, scheduled and actual departure and arrival times, airline carrier codes, flight numbers, plane tail numbers, origin and destination airports, airtime, distance traveled, and scheduled departure times broken down by hour and minute.

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

GGPlot - Distance vs. Air-time (Scatter Plot)

A linear relationship is demonstrated between variables of Distance and Air Time, with increases in distances corresponding to increases in Air Time. This relationship is mirrored in the three origin points of the EWR, JFK, and LGA airports.

Plotly - Avg. Delay and Total Flights (Pie Chart)

Total flights between the three airports is well distributed, with the Newark Liberty International Airport (EWR) possessing the most flights over the course of 2013 (117127 flights). Average delay is also displayed on the pie chart.

GGPlot - Months vs. Total Flights (Bar Plot)

The most flights are taken during the summer months of August and July, with a monthly average of 28064.67 flights per month.

Filter Data

Remove data points that are outliers based on the Inter-Quartile Range (IQR) and reformat the flights dataset based on the upper and lower bounds.

flights_sample = flights[sample(1:nrow(flights), 1000, replace = FALSE), ] # Use a sample for simplicity

Q1 = quantile(flights_sample[["dep_time"]], 0.25, na.rm=TRUE)
Q3 = quantile(flights_sample[["dep_time"]], 0.75, na.rm=TRUE)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

flights_sample = flights_sample %>% filter(dep_delay >= lower_bound & dep_delay <= upper_bound)
cat("Lower Bound:", lower_bound, "\n", "IQR", IQR, "\n", "Upper Bound:", upper_bound)
## Lower Bound: -284.625 
##  IQR 816.25 
##  Upper Bound: 2980.375

Plotly - Departure vs. Arrival Time vs. Distance (3D Plot)

As demonstrated within the graph, there is a quasi-linear relationship between departure time and arrival time. Analysis of the color bar further reveals that increases in distance results in a longer travel time, which is to be expected because longer flights take longer to complete.

GGPlot - Carrier vs. Departure Delay (Box and Whisker Plot)

Departure delays are categorized by flight carrier. The mean departure delay is marked with the dashed red line, and the median departure delay in blue. As seen, the much larger mean value indicates that there likely exists a number of large outliers in the positive departure delay direction that is skewing the data.

Statistical Analysis: ANOVA

We conducted a one-way ANOVA to determine if mean departure delay significantly differs across the top 5 airlines by number of flights.

##                 Df    Sum Sq Mean Sq F value Pr(>F)    
## carrier         15   6348034  423202   264.9 <2e-16 ***
## Residuals   328505 524819199    1598                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As demonstrated, the dataset has a p-value of 2*10^-16, indicating the set possesses a signficant relationship between the carrier and number of flights taken. The degrees of freedom (15) indicates that there are 16 total carriers at play.