A. Introduction

Research Question: What factors significantly influence arrival delay (arr_delay) for flights departing from New York City airports in 2013?

Flight delays impose substantial economic and personal costs on airlines and passengers. Understanding which factors contribute most strongly to arrival delays can help airlines improve scheduling efficiency and help passengers make more informed travel decisions.

The dataset used for this analysis is the NYC Flights 2013 dataset from OpenIntro. The dataset contains 336,776 observations (rows) and 19 variables (columns), where each observation represents a single commercial flight that departed from one of New York City’s major airports (JFK, LGA, or EWR) in 2013. Key variables used in this study include arr_delay (arrival delay in minutes), dep_delay (departure delay in minutes), carrier (airline code), month, and day. The dataset was obtained from the OpenIntro Data repository and is widely used for teaching applied statistical analysis.

This topic was chosen because flight delays are a real-world problem with practical implications, and the dataset provides rich information suitable for regression modeling. Additionally, arrival delay is a continuous outcome variable, making it well suited for a multiple linear regression approach.


B. Data Analysis

Before conducting statistical analysis, the dataset must be cleaned and prepared. First, observations with missing values in key variables such as arrival delay and departure delay are removed, since these values are essential for modeling. Next, only relevant variables are selected to simplify the analysis and improve model interpretability. Categorical variables such as airline carrier are converted into factors so they can be used appropriately in regression. Finally, exploratory filtering is applied to remove extreme delay values that could unduly influence the model.

library(dplyr)

flights_clean <- flights %>%
  filter(!is.na(arr_delay), !is.na(dep_delay)) %>%
  select(arr_delay, dep_delay, carrier, month, day) %>%
  mutate(carrier = as.factor(carrier))
summary(flights_clean)
flights_clean <- flights_clean %>%
  filter(arr_delay > -60, arr_delay < 300)

These data-wrangling steps ensure that the dataset is suitable for regression analysis and that the assumptions of the model are more likely to be met.


C. Statistical Analysis – Multiple Linear Regression

Multiple linear regression is used to examine how departure delay, airline carrier, and time of year affect arrival delay. This method is appropriate because the response variable, arrival delay, is continuous and we are interested in estimating the independent contribution of several predictors simultaneously.

The final model is specified as:

model <- lm(arr_delay ~ dep_delay + carrier + month, data = flights_clean)
summary(model)

In this model, the coefficient for dep_delay represents the expected change in arrival delay for each additional minute of departure delay, holding other variables constant. Carrier coefficients represent how average arrival delays differ across airlines relative to the reference carrier. Month coefficients capture seasonal variation in delays.

Model Diagnostics and Assumptions

The following assumptions are explicitly checked: linearity, independence of observations, homoscedasticity, normality of residuals, and multicollinearity.

par(mfrow = c(2,2))
plot(model)
  • Linearity: The Residuals vs Fitted plot shows no strong curvature, suggesting a linear relationship.
  • Independence: Each observation represents a distinct flight, so independence is reasonable.
  • Homoscedasticity: The Scale–Location plot shows relatively constant variance.
  • Normality: The Normal Q–Q plot indicates residuals are approximately normally distributed.
  • Multicollinearity:
library(car)
vif(model)

VIF values below 5 indicate that multicollinearity is not a concern.


D. Conclusion and Future Directions

The analysis shows that departure delay is the strongest predictor of arrival delay, with each additional minute of departure delay leading to a significant increase in arrival delay. Airline carrier and month also contribute to differences in arrival delay, suggesting that operational practices and seasonal effects play an important role.

These findings are directly relevant to the research question and highlight how operational efficiency at departure strongly determines on-time performance at arrival. Airlines could use this information to focus on reducing departure bottlenecks, while passengers might consider carrier choice and travel month when planning trips.

Future research could incorporate additional variables such as weather conditions, origin and destination airports, or aircraft type. Logistic regression could also be used to model the probability of a flight being delayed rather than the magnitude of the delay.


E. References

OpenIntro Data. NYC Flights 2013 Dataset. https://www.openintro.org/data/index.php?data=nycflights

R Core Team. R: A Language and Environment for Statistical Computing.

dplyr package documentation.

car package documentation.