Abstract
This research study aimed to develop a machine learning model capable of accurately predicting future flight delays. By leveraging a combination of environmental and technical information, various regression models were employed to analyze and interpret the data. The research goal was to identify a model that could effectively forecast flight delays based on these factors. The findings of this study provide valuable insights into the potential for using advanced machine learning techniques in the aviation industry to optimize flight scheduling and improve overall operational efficiency.
In this research endeavor, we will conduct a comprehensive analysis of the data and compare various regression models, focusing particularly on the simple regression model, multiple regression model, and polynomial regression model. The data for this study will be sourced from the open data provided by IBM. To carry out the analysis, we will adopt a data science approach utilizing R Studio to develop an Rmarkdown file. This approach will enable us to effectively explore and present the findings of our research in a structured and reproducible manner.
Simple regression is a statistical technique used to model the relationship between two variables: the independent variable (also known as the predictor variable) and the dependent variable (also known as the response variable). The goal of simple regression is to find a linear relationship between the two variables. It assumes that the relationship between the variables can be represented by a straight line. The equation for simple regression is typically represented as:
Here, y is the dependent variable, x is the independent variable, b0 is the y-intercept, and b1 is the slope of the line. The regression model estimates the values of b0 and b1 based on the given data points, aiming to minimize the differences between the predicted values and the actual values of the dependent variable.
Multiple regression is an extension of simple regression that involves more than one independent variable. It allows us to examine the relationship between a dependent variable and multiple predictors simultaneously. The objective of multiple regression is to estimate the coefficients of the independent variables that best predict the dependent variable. The equation for multiple regression can be represented as:
Here, y is the dependent variable, x1, x2, …, xn are the independent variables, b0 is the y-intercept, and b1, b2, …, bn are the respective coefficients or slopes associated with each independent variable. The multiple regression model estimates the values of b0, b1, b2, …, bn based on the given data, aiming to minimize the differences between the predicted values and the actual values of the dependent variable.
Polynomial regression is a form of multiple regression where the relationship between the independent and dependent variables is modeled using a polynomial equation of a specified degree. In polynomial regression, the relationship between the variables is not assumed to be linear, but instead, it can take a curvilinear shape. The equation for polynomial regression can be represented as:
Here, y is the dependent variable, x is the independent variable, b0, b1, b2, …, bn are the coefficients, and n represents the degree of the polynomial. By adjusting the degree of the polynomial, you can fit different types of curves to the data.
Polynomial regression allows for more flexible modeling, capturing complex relationships between variables. However, higher-degree polynomials can also introduce overfitting if not used judiciously, leading to poor generalization to new data.
To commence our analysis, we will begin by retrieving and examining the dataset. We will utilize the download.file() function to retrieve the data and subsequently employ the paged_table() function to display the dataset in a tabular format using an R table. This approach will allow us to conveniently visualize and analyze the contents of the dataset.
url <- "https://dax-cdn.cdn.appdomain.cloud/dax-airline/1.0.1/lax_to_jfk.tar.gz"
download.file(url, destfile = "lax_to_jfk.tar.gz")
untar("lax_to_jfk.tar.gz", tar = "internal")
data_airline <- read_csv("lax_to_jfk/lax_to_jfk.csv")
To address the issue of the last two columns, namely “DivDistance” and “DivArrDelay,” being treated as logical factors, we can modify the data set reading process by incorporating the “col_types()” parameter in the “read_csv()” function. By specifying the appropriate column types, we can ensure that these columns are correctly recognized as numeric variables. This adjustment will facilitate accurate numerical analysis and computations on these columns.
data_airline <- read_csv("lax_to_jfk/lax_to_jfk.csv",
col_types = cols(
"DivDistance" = col_number(),
"DivArrDelay" = col_number()
))
To initiate the simple linear regression analysis, we can adopt an initial approach by identifying the variable that is most likely to influence the flight delay. In this case, we can examine the correlation between the Arrival Delay and Departure Delay specifically for the Reporting Airline “Alaska.” By assessing the correlation between these two variables, we can gain insights into the potential relationship and determine the degree of influence Departure Delay has on Arrival Delay for flights operated by Alaska using the lm() formula. This preliminary analysis will serve as a basis for further investigation and model development in the simple linear regression analysis.
summary(linear_model)
##
## Call:
## lm(formula = ArrDelayMinutes ~ DepDelayMinutes, data = alaska_delay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.234 -12.716 -1.354 7.747 93.646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.3544 2.5084 6.919 2.9e-10 ***
## DepDelayMinutes 0.7523 0.0399 18.855 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.03 on 113 degrees of freedom
## Multiple R-squared: 0.7588, Adjusted R-squared: 0.7567
## F-statistic: 355.5 on 1 and 113 DF, p-value: < 2.2e-16
Upon printing the summary of the simple linear regression model, we can obtain the values of b0 (intercept) and b1 (slope). In this particular case, the values are determined to be 17.354 and 0.7523, respectively. With these coefficients identified, we can proceed to make predictions using the trained model on three new values based on the training dataset that was previously executed. This predictive analysis will enable us to estimate the corresponding Arrival Delay values for the given input variables, leveraging the established linear relationship captured by the model.
predicted_value
## fit lwr upr
## 1 26.38175 21.98838 30.77512
## 2 31.64769 27.52630 35.76908
## 3 35.40907 31.44593 39.37222
Upon receiving the table with three columns of values, let’s examine each column and its significance. The first column, fit, represents the most likely or common predicted values that we are likely to encounter based on the model. The remaining two columns, lwr and upr, correspond to the lower and upper bounds, respectively, of a confidence interval for the predicted values. This interval provides a range within which we can expect the true values to fall with a certain level of confidence.
To visualize the coefficients, we can filter the dataset using the “$” operator. By specifying the desired variable, such as “fit,” we can retrieve the corresponding subset of data that contains only the predicted values. This allows for a focused examination of the coefficients and their associated predictions.
linear_model$coefficients
## (Intercept) DepDelayMinutes
## 17.3544286 0.7522769
In order to determine the best fit among the different analyses conducted, visual assessment is crucial. Regression plots can be generated using the “ggplot()” function, providing a comprehensive visual understanding of the relationship between variables. These plots display the data points along with the fitted regression line, allowing for a visual evaluation of the model’s goodness of fit. By examining these plots, we can compare the relationships between predictor variables and the response variable, enabling us to make informed decisions regarding the model that exhibits the strongest fit.
From this plot, we can observe a positive correlation between Arrival Delay Minutes and Departure Delay Minutes, as indicated by the positive slope of the regression line. This suggests that an increase in Departure Delay Minutes is associated with a corresponding increase in Arrival Delay Minutes.
A residual plot evaluates the disparity between the Observed Values and the Predicted Values to determine the best fit for the regression model. To facilitate this analysis, we need to incorporate a new column in the dataset containing the predicted values. This can be achieved by utilizing the linear model that was initially applied in the chapter. By generating the predicted values using the linear model, we can create a column that aligns the predicted values with their corresponding observed values. This enables a direct comparison between the predicted and observed values, facilitating the assessment of the model’s fit.
To examine the performance of the linear model, we will employ the plot() function to generate four distinct plots:
Residual Plot: This plot depicts the difference between the observed and predicted values (residuals). It provides insights into the distribution and patterns of the residuals, helping to identify any deviations from linearity, heteroscedasticity, or outliers.
Q-Q Plot: This plot compares the quantiles of the residuals with the quantiles that would be expected in a normally distributed data. It assists in assessing whether the residuals follow a normal distribution. If the residuals align closely to a straight line, it indicates that they adhere to the assumption of normality.
Scale-Location Plot: This plot aids in evaluating the homoscedasticity assumption of the residuals. It examines the spread of residuals across the range of predicted values. A straight line in this plot indicates homoscedasticity, while a non-linear pattern indicates heteroscedasticity.
Residual vs. Leverage Plot: This plot displays the influence of individual observations on the regression model. It provides an understanding of which data points have a substantial impact on the regression line. The plot shows two dotted lines, representing the thresholds beyond which data points would have significant influence on the model’s results.
By analyzing these four plots, we can gain valuable insights into the validity and performance of the linear model, enabling us to make informed decisions regarding its suitability and accuracy.
In multiple linear regression, we can predict a variable—in this case, the Arrival Delay—based on the analysis of different predictor variables. For the current analysis, we will employ two predictor variables: DepDelayMinutes and LateAircraftDelay. By considering these variables simultaneously, we can develop a multiple linear regression model that incorporates their effects on the Arrival Delay.
summary(multiple_model)
##
## Call:
## lm(formula = ArrDelayMinutes ~ DepDelayMinutes + LateAircraftDelay,
## data = alaska_delay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.188 -12.545 -1.317 7.791 93.683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.31707 2.53786 6.823 4.78e-10 ***
## DepDelayMinutes 0.75556 0.04822 15.668 < 2e-16 ***
## LateAircraftDelay -0.01028 0.08407 -0.122 0.903
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.11 on 112 degrees of freedom
## Multiple R-squared: 0.7588, Adjusted R-squared: 0.7545
## F-statistic: 176.2 on 2 and 112 DF, p-value: < 2.2e-16
Polynomial regression is a specific instance of the general linear regression model or multiple linear regression models. Despite the data being nonlinear in polynomial regression, as it involves higher order terms of the predictor variables, the underlying model remains linear in all cases. By including polynomial terms in the regression model, we can capture and account for the nonlinearities present in the data within the framework of linear regression.
Let’s make an example:
1. LINEAR GRAPH
2. POLYNOMIAL GRAPH
To achieve a more precise line representation in the plot, we can utilize the “geom_smooth()” function with the additional argument “formula = y ~ poly(x, 5)”, where 5 represents the degree of the polynomial. By incorporating this formula, we can fit a polynomial regression line to the data, allowing for a more accurate representation of the relationship between the variables. The degree of the polynomial, indicated by the number 5 in this case, determines the complexity of the curve and should be adjusted based on the data and desired fit.
The final step in determining the best regression model is to compare and analyze numeric values such as the “R-squared” and the “Mean Squared Error” (MSE). The R-squared coefficient indicates the proportion of the variance in the dependent variable that is explained by the independent variables, presented as a percentage. A higher R-squared value implies a better fit between the model and the actual data. On the other hand, the Mean Squared Error calculates the average of the squared errors, which represent the differences between the actual and estimated values. It provides a measure of the model’s accuracy, with lower MSE values indicating better performance. Additionally, the Root Mean Squared Error (RMSE) is the square root of the MSE, offering a more interpretable measure of the average error.
To calculate these values for each previously built model, we can use appropriate functions or formulas specific to the statistical software or programming language being utilized.
## [1] 0.7588008
## [1] 394.0639
## [1] 19.85104
## [1] 0.7588329
## [1] 394.0113
## [1] 19.84972
## [1] 0.9930978
## [1] 0.03792681
## [1] 0.1947481
When evaluating the regression models, it is crucial to consider that a higher R-squared value indicates a better fit for the data. A higher R-squared value signifies that a larger proportion of the variance in the dependent variable can be explained by the independent variables in the model. Therefore, a higher R-squared value suggests that the model captures more of the underlying patterns and trends in the data. Similarly, a smaller Mean Squared Error (MSE) value indicates a better fit for the data. The MSE represents the average squared difference between the predicted and actual values, and a smaller MSE indicates that the model’s predictions are closer to the actual values, reflecting greater accuracy.
Based on the evaluation of both the R-squared value and the MSE value, the POLYNOMIAL MODEL demonstrates the highest R-squared value and the smallest MSE value among the models considered, we can therefor concluded that the polynomial model provides the best fit for the data. The higher R-squared value suggests a stronger correlation between the predicted and observed values, while the smaller MSE value indicates a smaller average discrepancy between the predicted and actual values.
Now that we have identified the best model with the highest R-squared value and the smallest MSE value, we can leverage its predictive capabilities to estimate the probable delays that flights may encounter based on the available technical and environmental information. By inputting these data into the model, we can rely on its calculations to generate predictions of the most likely delays.
The advantage of using the selected model is that it takes into account the relationships captured during the analysis phase, allowing for more accurate predictions. By incorporating time-sensitive technical and environmental data into the model, we can obtain timely estimates of the expected delays that flights might experience. This approach streamlines the process of predicting flight delays, providing a convenient and efficient means of estimating delays based on the input parameters. By leveraging the power of the selected model, we can make informed decisions and take appropriate actions to mitigate the impact of potential delays in the aviation industry.