Variable Descriptions
Aircraft: The make of an aircraft (Boeing or
Airbus).
Duration (in minutes): Flight duration
between taking off and landing. The duration of a normal flight should
always be greater than 40min.
No_pasg: The number
of passengers in a flight.
Speed_ground (in miles per
hour): The ground speed of an aircraft when passing over the
threshold of the runway. If its value is less than 30MPH or greater than
140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an
aircraft when passing over the threshold of the runway. If its value is
less than 30MPH or greater than 140MPH, then the landing would be
considered as abnormal.
Height (in meters): The
height of an aircraft when it is passing over the threshold of the
runway. The landing aircraft is required to be at least 6 meters high at
the threshold of the runway.
Pitch (in degrees):
Pitch angle of an aircraft when it is passing over the threshold of the
runway.
Distance (in feet): The landing distance
of an aircraft. More specifically, it refers to the distance between the
threshold of the runway and the point where the aircraft can be fully
stopped. The length of the airport runway is typically less than 6000
feet.
Step 1
Load necessary libraries such as dplyr, ggplot2, and import the FAA datasets (FAA1 and FAA2) for analysis.
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
setwd("C:/Users/renju/OneDrive/Documents/MSBANA/Spring/Spring Term 1/Statistical Modelling - 7042/Week 1 A Review of Linear Regression Models")
# Read in Dataset 1
FAA1 <- read_xls("FAA1.xls")
# Read in Dataset 2
FAA2 <- read_xls("FAA2.xls")
Step 2
Examine the structure, dimensions, and summary statistics of the FAA1 and FAA2 datasets to gain an understanding of the variables and their data types. Starting with the dataset structure is an effective way to explore the datasets. For instance, the FAA1 dataset contains 800 observations and 8 variables, as shown below. This provides a clear overview of the data’s dimensions, layout, and basic properties.
## tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:800] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:800] 109 103 NA NA NA ...
## $ height : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:800] 3370 2988 1145 1664 1050 ...
## tibble [150 × 7] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:150] 109 103 NA NA NA ...
## $ height : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:150] 3370 2988 1145 1664 1050 ...
The dataset FAA2-1 consists of 150 observations and 7 variables, as shown below.
Key differences between the two datasets (FAA1 and FAA2) are: Additional Variable: FAA1 includes an extra variable, duration, which is not present in FAA2. All other variables are identical. Number of Observations: FAA2 has significantly fewer observations compared to FAA1.
Step 3
In this step, we merge the FAA1 and FAA2 datasets using left_join(). This method ensures that all rows from FAA1 are retained, making it ideal since FAA1 contains additional variables and more data points. By performing a left join, duplicates are handled automatically, and no data is lost from FAA1.
The merge is conducted by joining on the common columns: aircraft, no_pasg, speed_ground, speed_air, height, pitch, and distance. After merging, we review the structure and overview of the resulting dataset to verify the operation’s success.
## Rows: 800
## Columns: 8
## $ aircraft <chr> "boeing", "boeing", "boeing", "boeing", "boeing", "boeing…
## $ duration <dbl> 98.47909, 125.73330, 112.01700, 196.82569, 90.09538, 137.…
## $ no_pasg <dbl> 53, 69, 61, 56, 70, 55, 54, 57, 61, 56, 61, 54, 54, 58, 6…
## $ speed_ground <dbl> 107.91568, 101.65559, 71.05196, 85.81333, 59.88853, 75.01…
## $ speed_air <dbl> 109.32838, 102.85141, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ height <dbl> 27.41892, 27.80472, 18.58939, 30.74460, 32.39769, 41.2149…
## $ pitch <dbl> 4.043515, 4.117432, 4.434043, 3.884236, 4.026096, 4.20385…
## $ distance <dbl> 3369.8364, 2987.8039, 1144.9224, 1664.2182, 1050.2645, 16…
The resulting merged dataset contains 800 observations, indicating
that there were 100 duplicate observations between the two datasets.
Additionally, 50 unique observations from FAA2
were
successfully combined with the FAA1
dataset during the
merge.
Step 4
Verify the combined dataset’s structure, dimensions, and variable consistency to ensure a successful merge. As performed above, we can check the structure of the combined dataset.
## tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:800] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:800] 109 103 NA NA NA ...
## $ height : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:800] 3370 2988 1145 1664 1050 ...
## aircraft duration no_pasg speed_ground
## Length:800 Min. : 14.76 Min. :29.00 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.00 1st Qu.: 65.87
## Mode :character Median :153.95 Median :60.00 Median : 79.64
## Mean :154.01 Mean :60.13 Mean : 79.54
## 3rd Qu.:188.91 3rd Qu.:65.00 3rd Qu.: 92.33
## Max. :305.62 Max. :87.00 Max. :141.22
##
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.16 1st Qu.:23.338 1st Qu.:3.658 1st Qu.: 900.95
## Median :100.99 Median :30.147 Median :4.020 Median :1267.44
## Mean :103.83 Mean :30.122 Mean :4.018 Mean :1544.52
## 3rd Qu.:109.48 3rd Qu.:36.981 3rd Qu.:4.388 3rd Qu.:1960.44
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :600
The sample size of the dataset is 800 and the dataset has 8 variables, 7 of which are numeric. Below are the summary statistics for each variable.
Step 5
Highlight potential anomalies in the dataset, including extreme distance values, negative height, and invalid speed_air observations. Key findings from the initial analysis are summarized below:
Landing Distance Anomalies:
Most aircraft land within 6,000 feet, but some observations show distance values exceeding this threshold, suggesting potential runway overruns.Negative Heights:
The minimum value of height is negative, which is unrealistic as the aircraft’s height at the runway threshold should be at least 6 meters. This indicates potential data input errors.Missing Air Speed Values:
The speed_air variable contains 600 missing values, which must be addressed during the analysis to ensure accurate results.Abnormally Short Flight Durations:
According to the Variable Dictionary, a normal flight duration should exceed 40 minutes. However, the dataset includes an observation with a duration of 14.76 minutes, which is inconsistent with expectations.Abnormal Speed Values:
The Variable Dictionary specifies that ground speed (speed_ground) and air speed (speed_air) should range between 30 and 140 MPH for normal operations. The summary statistics reveal observations with speeds outside this range, indicating abnormal flights.
These anomalies highlight the need for data cleaning to ensure the integrity of subsequent analysis.
Data Cleaning and Further Exploration
Step 6
Filter out observations with negative height, short duration (< 40 minutes), and abnormal speed (speed_ground < 30 or > 140).
As noted in step 5, there are multiple abnormal values in the dataset. A value is defined as “abnormal” in the data dictionary provided.
Height
We see these abnormal flights primarily in the height variable, where we have a negative value in an observation. We can see how many instances exist:
## Number of observations with negative height: 5
## Number of observations with flight duration <= 40 minutes: 5
## Number of observations with abnormal air or ground speed: 3
## Total number of observations removed after filtering: 13
There were 5 observations where height was negative. Height should never be negative, due to physics.
Duration
If the duration of the flight is less than or equal to 40 minutes, the flight is considered abnormal. Let us filter for these observations to count how many exist.
There were 5 observations where duration was less than or equal to 40 minutes.
Ground Speed and Air Speed
Abnormal flights are also defined as those with speed_air and/or speed_ground less than 30 MPH or greater than 140 MPH.
There were 3 observations where speed_air or speed_ground was less than 30 MPH or greater than 140 MPH.
Filtering out identified abnormal observations
Let us filter out these observations identified above and create a new working dataset.
Thirteen total observations have been deemed abnormal and have consequently been removed.
Note: We have retained observations with missing values.
Step 7
Re-evaluate the structure and summary statistics of the cleaned dataset to confirm the removal of abnormal values.
In step 7, we repeat step 4 but with the dataset that now only contains normal flights. We therefore check the structure of the combined dataset as well as provide summary statistics.
## tibble [787 × 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:787] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:787] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:787] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:787] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:787] 109 103 NA NA NA ...
## $ height : num [1:787] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:787] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:787] 3370 2988 1145 1664 1050 ...
## aircraft duration no_pasg speed_ground
## Length:787 Min. : 41.95 Min. :29.00 Min. : 33.57
## Class :character 1st Qu.:119.68 1st Qu.:55.00 1st Qu.: 66.01
## Mode :character Median :154.24 Median :60.00 Median : 79.73
## Mean :154.80 Mean :60.13 Mean : 79.61
## 3rd Qu.:189.62 3rd Qu.:65.00 3rd Qu.: 92.13
## Max. :305.62 Max. :87.00 Max. :136.66
##
## speed_air height pitch distance
## Min. : 90.00 Min. : 0.08611 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.16 1st Qu.:23.46558 1st Qu.:3.654 1st Qu.: 902.82
## Median :100.99 Median :30.16708 Median :4.014 Median :1267.76
## Mean :103.67 Mean :30.29378 Mean :4.015 Mean :1541.02
## 3rd Qu.:109.48 3rd Qu.:36.98364 3rd Qu.:4.385 3rd Qu.:1960.41
## Max. :136.42 Max. :59.94596 Max. :5.927 Max. :6309.95
## NA's :591
There are 787 observations and 8 variables in this dataset. The variable speed_air has 591 missing values.
Step 8
Generate histograms for duration, speed_air, height, and other numeric variables to visualize their distributions.
Let us create histograms for our 7 numeric variables. These
histograms will provide insight regarding the distribution and
variability of the observations within each variable. See Step 9 for
conclusions regarding these histograms.
Step 9
Derive insights such as skewness, peaks, or missing value patterns from the histograms for numeric variables. Another “presentation slide” one might want to present to the FAA agents would include the following points:
Given the fact that our dataset no longer includes flights deemed “abnormal,” we can say that a normal flight’s ground speed on average is around 80 MPH, and the distribution of the variation in ground speed is symmetrical. This is different than the distribution of air speed, which is around 100 MPH but is strongly skewed to the right (meaning there are likely outliers).
The landing distance is highly skewed to the right as well, however very few landings go over 6000 feet.
Apart from speed_air and distance, most variables are evenly distributed. Not only do the histograms show this, but the mean and median (listed in the summary statistics) for each variable are similar.
There exist 206 observations with complete data. The entire dataset contains 787 unique observations.
These are basic observations. Further analysis can break down these statistics by carrier or aircraft size.
Initial Analysis for ID’ing Important Factors that Impact Response Variables
Step 10
Calculate and visualize correlations between numeric variables and distance to identify strong predictors. In this step we compute the pairwise correlation between the landing distance and each factor X. Since we are not concerned with the correlation between the various X factors, we will manually create this correlation table.
The table below (called Table 1) ranks the factors based on correlation strength (indicated by the absolute value) from strongest to weakest.
X_Variable | Cor_with_Distance | Cor_Direction |
---|---|---|
speed_air | 0.945 | positive |
speed_ground | 0.932 | positive |
height | 0.086 | positive |
pitch | 0.037 | positive |
duration | 0.037 | positive |
no_pasg | -0.019 | negative |
The variables speed_air and speed_ground have the strongest correlation to the response variable distance. Both correlations are positive, meaning that an increase in either air or ground speed correlates to a longer landing distance. The number of passagengers on the plane negatively correlates to the landing distance. That is, smaller planes are correlated to smaller landing distances.
Step 11
Create scatterplots to explore the relationships between
distance and predictors like speed_ground,
height, etc. We can also display scatterplots of the variables
on our dependent variable. These scatterplots are great visual
representations for our correlation table. It is clear that Ground Speed
(speed_ground) and Air Speed (speed_air) are highly
correlated with our response variable, distance. We see this
because the datapoints have small variability (less scattered) around a
general upward trend. The other variables have no apparent pattern. It
is interesting to note that speed_ground has a slight curve to
the pattern shown in the scatterplot, which potentially indicates a
nonlinear relationship between the X variable and distance.
Step 12
Encode aircraft as a binary variable (1 for Boeing, 0 for Airbus) for modeling purposes. Above, our analysis only includes numeric variables. We can also include the aircraft variable by coding the character variable as 0/1 binary. If aircraft is equal to 1, it is a boeing aircraft. Otherwise, its make is airbus. After encoding this variable as a binary variable, we can repeat Steps 10 and 11 and include the variable in the table and scatterplots.
X_Variable | Cor_with_Distance | Cor_Direction |
---|---|---|
speed_air | 0.9452968 | positive |
speed_ground | 0.9317169 | positive |
aircraft | 0.1815418 | positive |
height | 0.0856803 | positive |
pitch | 0.0372342 | positive |
duration | 0.0367128 | positive |
no_pasg | -0.0187942 | negative |
We can illustrate again these correlations in Table 1.1 through various paired scatterplots. The aircraft correlation plot is consistent with the positive correlation given in the table. Note that all other scatterplots are the same as shown in step 11.
Regression Using a Single Factor
Step 13
In Step 13, we perform individual regressions of the response variable Y on each predictor X. This results in seven separate linear models, one for each X variable in the dataset. To summarize these models, we create a table (Table 2 below) that ranks the predictors based on their significance in explaining the response variable. The table includes the name of each X variable, its significance level with respect to the response variable, and the direction of its coefficient, indicating whether the relationship is positive or negative.
X_Variable | P_Value | Dir_of_Reg_Coeff |
---|---|---|
speed_ground | 0.0000000 | positive |
speed_air | 0.0000000 | positive |
aircraft | 0.0000000 | positive |
height | 0.0003499 | positive |
pitch | 0.0607952 | positive |
duration | 0.1107031 | negative |
no_pasg | 0.5842035 | negative |
Step 14
We can standardize each X variable by creating a new standardized version of the variable. Standardization allows us to better compare the relative impact of different predictors on the response variable. When X variables are standardized, the interpretation changes: for every 1 standard deviation increase in the standardized X variable, the response variable Y is expected to change by 1 * (the regression coefficient of X) standard deviations.
We now repeat Step 13 by regressing Y on each of the standardized X’ (X-Prime) variables. The resulting summary table provides insight into how many standard deviations Y will increase (or decrease, if the coefficient is negative) for a one standard deviation increase in the corresponding X’ variable.
X_Variable | Size_of_Reg_Coeff | Dir_of_Reg_Coeff |
---|---|---|
speed_air_prime | 818.06654 | positive |
speed_ground_prime | 799.07204 | positive |
aircraft_prime | 208.81380 | positive |
height_prime | 117.01884 | positive |
pitch_prime | 61.54969 | positive |
duration_prime | 52.37631 | negative |
no_pasg_prime | 17.98305 | negative |
Step 15
The strength of the correlation is generally aligned with the significance of the regressor (indicated by its p-value). This is expected, as variables that are strongly correlated with the response variable are likely to have greater predictive power in the model. However, it is important to note that correlation does not imply causation. Table 0 highlights the factors ranked by their relative importance in predicting the landing distance.
X_Variable | Cor_with_Distance | Cor_Direction | P_Value | Dir_of_Reg_Coeff | Stdized_Coeff | Dir_of_Stdized_Coeff |
---|---|---|---|---|---|---|
speed_air | 0.9452968 | positive | 0.0000000 | positive | 818.06654 | positive |
aircraft | 0.1815418 | positive | 0.0000000 | positive | 208.81380 | positive |
height | 0.0856803 | positive | 0.0003499 | positive | 117.01884 | positive |
pitch | 0.0372342 | positive | 0.0607952 | positive | 61.54969 | positive |
duration | 0.0367128 | positive | 0.1107031 | negative | 52.37631 | negative |
no_pasg | -0.0187942 | negative | 0.5842035 | negative | 17.98305 | negative |
speed_ground | 0.9317169 | positive | 0.0000000 | positive | 799.07204 | positive |
Note that we put speed_ground last because in the next step (Step 16), we determine to drop this variable from the model.
Check Collinearity
Step 16
We are asked to consider the regression coefficients of the three models:
##
## Coefficients for Model 1 (distance ~ speed_ground):
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1810.05607 70.3035822 -25.74629 1.875900e-106
## speed_ground 42.09316 0.8590351 49.00052 5.312128e-241
##
## Coefficients for Model 2 (distance ~ speed_air):
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5568.45181 208.380651 -26.72250 6.126904e-67
## speed_air 80.74381 2.000504 40.36174 2.514701e-96
##
## Coefficients for Model 3 (distance ~ speed_ground + speed_air):
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5576.43359 208.65984 -26.7249967 8.845047e-67
## speed_ground -12.06878 13.28744 -0.9082846 3.648606e-01
## speed_air 92.88100 13.51181 6.8740600 8.410915e-11
##
## Correlation between speed_ground and speed_air:
## speed_air
## speed_ground 0.988969
##
## Variance Inflation Factor (VIF) for Model 3:
## speed_ground speed_air
## 45.57813 45.57813
##
## Correlation matrix for speed_ground, speed_air, and distance:
## speed_ground speed_air distance
## speed_ground 1.0000000 0.9889690 0.9317169
## speed_air 0.9889690 1.0000000 0.9452968
## distance 0.9317169 0.9452968 1.0000000
When comparing these three models, we observe that the coefficient for speed_ground undergoes significant changes in both magnitude and sign. Similarly, the coefficient for speed_air also changes, though to a lesser extent than speed_ground. Ideally, adding or removing covariates in a regression model should not cause substantial shifts in the values of regressor coefficients. When such drastic changes occur, it is a strong indication of multicollinearity among the predictors.
Let’s examine the collinearity between the two covariates. These variables are highly correlated, with an R-value of 0.989. To avoid multicollinearity, only one of these variables should be included in the final model. The variable with the higher Variance Inflation Factor (VIF) would typically be dropped.
In this case, both variables have the same VIF. Therefore, we can choose to drop the variable that is less correlated with the response variable. Since speed_ground is slightly less correlated with distance compared to speed_air (though the difference is minimal), we decide to exclude speed_ground from the model.
Variable Selection based on our ranking in Table 0.
Step 17
We will now fit multiple models, each model having one more covariate
than the preceding model.
cov | rsqd |
---|---|
1 | 0.8936 |
2 | 0.9513 |
3 | 0.9751 |
4 | 0.9751 |
5 | 0.9752 |
6 | 0.9754 |
7 | 0.9754 |
Step 18
Let us plot the number of covariates against Adjusted R-Squared, a statistic that accounts for the number of predictors in the model by penalizing models with an excessive number of covariates.
As shown below, it is common to observe a pattern where adding more covariates to a multiple linear regression model increases the R-squared value. However, Adjusted R-Squared provides a more reliable measure by balancing model complexity and goodness-of-fit.
cov | adj.rsqd |
---|---|
1 | 0.8930 |
2 | 0.9507 |
3 | 0.9747 |
4 | 0.9746 |
5 | 0.9745 |
6 | 0.9746 |
7 | 0.9745 |
Step 19
Finally, let us plot the number of covariates against another evaluation metric, AIC (Akaike Information Criterion). A lower AIC value indicates a better-fitting model, as it balances goodness-of-fit with model complexity.
cov | aic.fig |
---|---|
1 | 2,773.274 |
2 | 2,622.261 |
3 | 2,492.778 |
4 | 2,494.348 |
5 | 2,495.985 |
6 | 2,496.322 |
7 | 2,498.048 |
Step 20
All three evaluation metrics—R-Squared, Adjusted R-Squared, and AIC—identify Model 3 as the best predictive model for landing distance (LD). Model 3 achieves the highest Adjusted R-Squared and the lowest AIC among the models reviewed.
Based on this analysis, the three variables selected for building a predictive model for LD are speed_air, aircraft, and height.
##
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = FAA_merge)
##
## Residuals:
## Min 1Q Median 3Q Max
## -304.14 -93.22 14.88 91.95 425.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6484.5380 110.1352 -58.88 <2e-16 ***
## speed_air 82.8340 0.9769 84.79 <2e-16 ***
## aircraft 439.0176 20.2013 21.73 <2e-16 ***
## height 14.2244 1.0500 13.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 137.7 on 192 degrees of freedom
## (591 observations deleted due to missingness)
## Multiple R-squared: 0.9751, Adjusted R-squared: 0.9747
## F-statistic: 2504 on 3 and 192 DF, p-value: < 2.2e-16
Variable Selection based on Automatic Algorithm
Step 21
We can use the R function “StepAIC” to perform forward variable selection and automatically select the best model.
## Start: AIC=1939.82
## distance ~ speed_air + aircraft + height + pitch + duration +
## no_pasg + speed_ground
##
## Df Sum of Sq RSS AIC
## - duration 1 3427 3593003 1938.0
## - speed_ground 1 5011 3594587 1938.1
## - pitch 1 9376 3598952 1938.3
## - no_pasg 1 30384 3619959 1939.5
## <none> 3589575 1939.8
## - speed_air 1 3161032 6750608 2061.6
## - height 1 3410044 6999620 2068.7
## - aircraft 1 7904682 11494257 2165.9
##
## Step: AIC=1938.01
## distance ~ speed_air + aircraft + height + pitch + no_pasg +
## speed_ground
##
## Df Sum of Sq RSS AIC
## - speed_ground 1 6293 3599296 1936.3
## - pitch 1 10041 3603043 1936.6
## - no_pasg 1 32056 3625059 1937.8
## <none> 3593003 1938.0
## - speed_air 1 3250783 6843786 2062.3
## - height 1 3434796 7027798 2067.5
## - aircraft 1 7901833 11494836 2163.9
##
## Step: AIC=1936.35
## distance ~ speed_air + aircraft + height + pitch + no_pasg
##
## Df Sum of Sq RSS AIC
## - pitch 1 8583 3607879 1934.8
## - no_pasg 1 32652 3631948 1936.1
## <none> 3599296 1936.3
## - height 1 3469592 7068888 2066.7
## - aircraft 1 7897070 11496366 2162.0
## - speed_air 1 136189067 139788363 2651.6
##
## Step: AIC=1934.82
## distance ~ speed_air + aircraft + height + no_pasg
##
## Df Sum of Sq RSS AIC
## - no_pasg 1 32041 3639920 1934.5
## <none> 3607879 1934.8
## - height 1 3476420 7084299 2065.1
## - aircraft 1 8871841 12479720 2176.1
## - speed_air 1 136312487 139920365 2649.8
##
## Step: AIC=1934.55
## distance ~ speed_air + aircraft + height
##
## Df Sum of Sq RSS AIC
## <none> 3639920 1934.5
## - height 1 3479278 7119197 2064.0
## - aircraft 1 8953580 12593499 2175.8
## - speed_air 1 136291471 139931391 2647.8
##
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = FAA_merge)
##
## Coefficients:
## (Intercept) speed_air aircraft height
## -6484.54 82.83 439.02 14.22
The stepAIC function confirms that the selected model remains unchanged.The stepwise selection process identifies the same model with an AIC value of 1934.55
distance ~ speed_air + aircraft + height
This result reaffirms the appropriateness of these variables (speed_air, aircraft, and height) for predicting landing distance.