Packages Used

library(readxl)   
library(tidyverse) 
library(knitr)     
library(car)
library(MASS)

Variable Descriptions

Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

Step 1

Load necessary libraries such as dplyr, ggplot2, and import the FAA datasets (FAA1 and FAA2) for analysis.

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
setwd("C:/Users/renju/OneDrive/Documents/MSBANA/Spring/Spring Term 1/Statistical Modelling - 7042/Week 1  A Review of Linear Regression Models")
# Read in Dataset 1
FAA1 <- read_xls("FAA1.xls")
# Read in Dataset 2
FAA2 <- read_xls("FAA2.xls")

Step 2

Examine the structure, dimensions, and summary statistics of the FAA1 and FAA2 datasets to gain an understanding of the variables and their data types. Starting with the dataset structure is an effective way to explore the datasets. For instance, the FAA1 dataset contains 800 observations and 8 variables, as shown below. This provides a clear overview of the data’s dimensions, layout, and basic properties.

## tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:800] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:800] 109 103 NA NA NA ...
##  $ height      : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:800] 3370 2988 1145 1664 1050 ...

## tibble [150 × 7] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:150] 109 103 NA NA NA ...
##  $ height      : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:150] 3370 2988 1145 1664 1050 ...

The dataset FAA2-1 consists of 150 observations and 7 variables, as shown below.

Key differences between the two datasets (FAA1 and FAA2) are: Additional Variable: FAA1 includes an extra variable, duration, which is not present in FAA2. All other variables are identical. Number of Observations: FAA2 has significantly fewer observations compared to FAA1.

Step 3

In this step, we merge the FAA1 and FAA2 datasets using left_join(). This method ensures that all rows from FAA1 are retained, making it ideal since FAA1 contains additional variables and more data points. By performing a left join, duplicates are handled automatically, and no data is lost from FAA1.

The merge is conducted by joining on the common columns: aircraft, no_pasg, speed_ground, speed_air, height, pitch, and distance. After merging, we review the structure and overview of the resulting dataset to verify the operation’s success.

## Rows: 800
## Columns: 8
## $ aircraft     <chr> "boeing", "boeing", "boeing", "boeing", "boeing", "boeing…
## $ duration     <dbl> 98.47909, 125.73330, 112.01700, 196.82569, 90.09538, 137.…
## $ no_pasg      <dbl> 53, 69, 61, 56, 70, 55, 54, 57, 61, 56, 61, 54, 54, 58, 6…
## $ speed_ground <dbl> 107.91568, 101.65559, 71.05196, 85.81333, 59.88853, 75.01…
## $ speed_air    <dbl> 109.32838, 102.85141, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ height       <dbl> 27.41892, 27.80472, 18.58939, 30.74460, 32.39769, 41.2149…
## $ pitch        <dbl> 4.043515, 4.117432, 4.434043, 3.884236, 4.026096, 4.20385…
## $ distance     <dbl> 3369.8364, 2987.8039, 1144.9224, 1664.2182, 1050.2645, 16…

The resulting merged dataset contains 800 observations, indicating that there were 100 duplicate observations between the two datasets. Additionally, 50 unique observations from FAA2 were successfully combined with the FAA1 dataset during the merge.

Step 4

Verify the combined dataset’s structure, dimensions, and variable consistency to ensure a successful merge. As performed above, we can check the structure of the combined dataset.

## tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:800] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:800] 109 103 NA NA NA ...
##  $ height      : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:800] 3370 2988 1145 1664 1050 ...

##    aircraft            duration         no_pasg       speed_ground   
##  Length:800         Min.   : 14.76   Min.   :29.00   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.00   1st Qu.: 65.87  
##  Mode  :character   Median :153.95   Median :60.00   Median : 79.64  
##                     Mean   :154.01   Mean   :60.13   Mean   : 79.54  
##                     3rd Qu.:188.91   3rd Qu.:65.00   3rd Qu.: 92.33  
##                     Max.   :305.62   Max.   :87.00   Max.   :141.22  
##                                                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.16   1st Qu.:23.338   1st Qu.:3.658   1st Qu.: 900.95  
##  Median :100.99   Median :30.147   Median :4.020   Median :1267.44  
##  Mean   :103.83   Mean   :30.122   Mean   :4.018   Mean   :1544.52  
##  3rd Qu.:109.48   3rd Qu.:36.981   3rd Qu.:4.388   3rd Qu.:1960.44  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :600

The sample size of the dataset is 800 and the dataset has 8 variables, 7 of which are numeric. Below are the summary statistics for each variable.

Step 5

Highlight potential anomalies in the dataset, including extreme distance values, negative height, and invalid speed_air observations. Key findings from the initial analysis are summarized below:

Landing Distance Anomalies:
Most aircraft land within 6,000 feet, but some observations show distance values exceeding this threshold, suggesting potential runway overruns.
Negative Heights:
The minimum value of height is negative, which is unrealistic as the aircraft’s height at the runway threshold should be at least 6 meters. This indicates potential data input errors.
Missing Air Speed Values:
The speed_air variable contains 600 missing values, which must be addressed during the analysis to ensure accurate results.
Abnormally Short Flight Durations:
According to the Variable Dictionary, a normal flight duration should exceed 40 minutes. However, the dataset includes an observation with a duration of 14.76 minutes, which is inconsistent with expectations.
Abnormal Speed Values:
The Variable Dictionary specifies that ground speed (speed_ground) and air speed (speed_air) should range between 30 and 140 MPH for normal operations. The summary statistics reveal observations with speeds outside this range, indicating abnormal flights.

These anomalies highlight the need for data cleaning to ensure the integrity of subsequent analysis.

Data Cleaning and Further Exploration

Step 6

Filter out observations with negative height, short duration (< 40 minutes), and abnormal speed (speed_ground < 30 or > 140).

As noted in step 5, there are multiple abnormal values in the dataset. A value is defined as “abnormal” in the data dictionary provided.

Height

We see these abnormal flights primarily in the height variable, where we have a negative value in an observation. We can see how many instances exist:

## Number of observations with negative height: 5

## Number of observations with flight duration <= 40 minutes: 5

## Number of observations with abnormal air or ground speed: 3

## Total number of observations removed after filtering: 13

There were 5 observations where height was negative. Height should never be negative, due to physics.

Duration

If the duration of the flight is less than or equal to 40 minutes, the flight is considered abnormal. Let us filter for these observations to count how many exist.

There were 5 observations where duration was less than or equal to 40 minutes.

Ground Speed and Air Speed

Abnormal flights are also defined as those with speed_air and/or speed_ground less than 30 MPH or greater than 140 MPH.

There were 3 observations where speed_air or speed_ground was less than 30 MPH or greater than 140 MPH.

Filtering out identified abnormal observations

Let us filter out these observations identified above and create a new working dataset.

Thirteen total observations have been deemed abnormal and have consequently been removed.

Note: We have retained observations with missing values.

Step 7

Re-evaluate the structure and summary statistics of the cleaned dataset to confirm the removal of abnormal values.

In step 7, we repeat step 4 but with the dataset that now only contains normal flights. We therefore check the structure of the combined dataset as well as provide summary statistics.

## tibble [787 × 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:787] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:787] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:787] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:787] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:787] 109 103 NA NA NA ...
##  $ height      : num [1:787] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:787] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:787] 3370 2988 1145 1664 1050 ...

##    aircraft            duration         no_pasg       speed_ground   
##  Length:787         Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
##  Class :character   1st Qu.:119.68   1st Qu.:55.00   1st Qu.: 66.01  
##  Mode  :character   Median :154.24   Median :60.00   Median : 79.73  
##                     Mean   :154.80   Mean   :60.13   Mean   : 79.61  
##                     3rd Qu.:189.62   3rd Qu.:65.00   3rd Qu.: 92.13  
##                     Max.   :305.62   Max.   :87.00   Max.   :136.66  
##                                                                      
##    speed_air          height             pitch          distance      
##  Min.   : 90.00   Min.   : 0.08611   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.16   1st Qu.:23.46558   1st Qu.:3.654   1st Qu.: 902.82  
##  Median :100.99   Median :30.16708   Median :4.014   Median :1267.76  
##  Mean   :103.67   Mean   :30.29378   Mean   :4.015   Mean   :1541.02  
##  3rd Qu.:109.48   3rd Qu.:36.98364   3rd Qu.:4.385   3rd Qu.:1960.41  
##  Max.   :136.42   Max.   :59.94596   Max.   :5.927   Max.   :6309.95  
##  NA's   :591

There are 787 observations and 8 variables in this dataset. The variable speed_air has 591 missing values.

Step 8

Generate histograms for duration, speed_air, height, and other numeric variables to visualize their distributions.

Let us create histograms for our 7 numeric variables. These histograms will provide insight regarding the distribution and variability of the observations within each variable. See Step 9 for conclusions regarding these histograms.

Step 9

Derive insights such as skewness, peaks, or missing value patterns from the histograms for numeric variables. Another “presentation slide” one might want to present to the FAA agents would include the following points:

Given the fact that our dataset no longer includes flights deemed “abnormal,” we can say that a normal flight’s ground speed on average is around 80 MPH, and the distribution of the variation in ground speed is symmetrical. This is different than the distribution of air speed, which is around 100 MPH but is strongly skewed to the right (meaning there are likely outliers).

The landing distance is highly skewed to the right as well, however very few landings go over 6000 feet.

Apart from speed_air and distance, most variables are evenly distributed. Not only do the histograms show this, but the mean and median (listed in the summary statistics) for each variable are similar.

There exist 206 observations with complete data. The entire dataset contains 787 unique observations.

These are basic observations. Further analysis can break down these statistics by carrier or aircraft size.

Initial Analysis for ID’ing Important Factors that Impact Response Variables

Step 10

Calculate and visualize correlations between numeric variables and distance to identify strong predictors. In this step we compute the pairwise correlation between the landing distance and each factor X. Since we are not concerned with the correlation between the various X factors, we will manually create this correlation table.

The table below (called Table 1) ranks the factors based on correlation strength (indicated by the absolute value) from strongest to weakest.

Table 1
X_Variable	Cor_with_Distance	Cor_Direction
speed_air	0.945	positive
speed_ground	0.932	positive
height	0.086	positive
pitch	0.037	positive
duration	0.037	positive
no_pasg	-0.019	negative

The variables speed_air and speed_ground have the strongest correlation to the response variable distance. Both correlations are positive, meaning that an increase in either air or ground speed correlates to a longer landing distance. The number of passagengers on the plane negatively correlates to the landing distance. That is, smaller planes are correlated to smaller landing distances.

Step 11

Create scatterplots to explore the relationships between distance and predictors like speed_ground, height, etc. We can also display scatterplots of the variables on our dependent variable. These scatterplots are great visual representations for our correlation table. It is clear that Ground Speed (speed_ground) and Air Speed (speed_air) are highly correlated with our response variable, distance. We see this because the datapoints have small variability (less scattered) around a general upward trend. The other variables have no apparent pattern. It is interesting to note that speed_ground has a slight curve to the pattern shown in the scatterplot, which potentially indicates a nonlinear relationship between the X variable and distance.

Step 12

Encode aircraft as a binary variable (1 for Boeing, 0 for Airbus) for modeling purposes. Above, our analysis only includes numeric variables. We can also include the aircraft variable by coding the character variable as 0/1 binary. If aircraft is equal to 1, it is a boeing aircraft. Otherwise, its make is airbus. After encoding this variable as a binary variable, we can repeat Steps 10 and 11 and include the variable in the table and scatterplots.

Table 1.1
X_Variable	Cor_with_Distance	Cor_Direction
speed_air	0.9452968	positive
speed_ground	0.9317169	positive
aircraft	0.1815418	positive
height	0.0856803	positive
pitch	0.0372342	positive
duration	0.0367128	positive
no_pasg	-0.0187942	negative

We can illustrate again these correlations in Table 1.1 through various paired scatterplots. The aircraft correlation plot is consistent with the positive correlation given in the table. Note that all other scatterplots are the same as shown in step 11.

Regression Using a Single Factor

Step 13

In Step 13, we perform individual regressions of the response variable Y on each predictor X. This results in seven separate linear models, one for each X variable in the dataset. To summarize these models, we create a table (Table 2 below) that ranks the predictors based on their significance in explaining the response variable. The table includes the name of each X variable, its significance level with respect to the response variable, and the direction of its coefficient, indicating whether the relationship is positive or negative.

Table 2
X_Variable	P_Value	Dir_of_Reg_Coeff
speed_ground	0.0000000	positive
speed_air	0.0000000	positive
aircraft	0.0000000	positive
height	0.0003499	positive
pitch	0.0607952	positive
duration	0.1107031	negative
no_pasg	0.5842035	negative

Step 14

We can standardize each X variable by creating a new standardized version of the variable. Standardization allows us to better compare the relative impact of different predictors on the response variable. When X variables are standardized, the interpretation changes: for every 1 standard deviation increase in the standardized X variable, the response variable Y is expected to change by 1 * (the regression coefficient of X) standard deviations.

We now repeat Step 13 by regressing Y on each of the standardized X’ (X-Prime) variables. The resulting summary table provides insight into how many standard deviations Y will increase (or decrease, if the coefficient is negative) for a one standard deviation increase in the corresponding X’ variable.

Table 3
X_Variable	Size_of_Reg_Coeff	Dir_of_Reg_Coeff
speed_air_prime	818.06654	positive
speed_ground_prime	799.07204	positive
aircraft_prime	208.81380	positive
height_prime	117.01884	positive
pitch_prime	61.54969	positive
duration_prime	52.37631	negative
no_pasg_prime	17.98305	negative

Step 15

The strength of the correlation is generally aligned with the significance of the regressor (indicated by its p-value). This is expected, as variables that are strongly correlated with the response variable are likely to have greater predictive power in the model. However, it is important to note that correlation does not imply causation. Table 0 highlights the factors ranked by their relative importance in predicting the landing distance.

Table 0
X_Variable	Cor_with_Distance	Cor_Direction	P_Value	Dir_of_Reg_Coeff	Stdized_Coeff	Dir_of_Stdized_Coeff
speed_air	0.9452968	positive	0.0000000	positive	818.06654	positive
aircraft	0.1815418	positive	0.0000000	positive	208.81380	positive
height	0.0856803	positive	0.0003499	positive	117.01884	positive
pitch	0.0372342	positive	0.0607952	positive	61.54969	positive
duration	0.0367128	positive	0.1107031	negative	52.37631	negative
no_pasg	-0.0187942	negative	0.5842035	negative	17.98305	negative
speed_ground	0.9317169	positive	0.0000000	positive	799.07204	positive

Note that we put speed_ground last because in the next step (Step 16), we determine to drop this variable from the model.

Check Collinearity

Step 16

We are asked to consider the regression coefficients of the three models:

## 
## Coefficients for Model 1 (distance ~ speed_ground):

##                 Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)  -1810.05607 70.3035822 -25.74629 1.875900e-106
## speed_ground    42.09316  0.8590351  49.00052 5.312128e-241

## 
## Coefficients for Model 2 (distance ~ speed_air):

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -5568.45181 208.380651 -26.72250 6.126904e-67
## speed_air      80.74381   2.000504  40.36174 2.514701e-96

## 
## Coefficients for Model 3 (distance ~ speed_ground + speed_air):

##                 Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)  -5576.43359  208.65984 -26.7249967 8.845047e-67
## speed_ground   -12.06878   13.28744  -0.9082846 3.648606e-01
## speed_air       92.88100   13.51181   6.8740600 8.410915e-11

## 
## Correlation between speed_ground and speed_air:

##              speed_air
## speed_ground  0.988969

## 
## Variance Inflation Factor (VIF) for Model 3:

## speed_ground    speed_air 
##     45.57813     45.57813

## 
## Correlation matrix for speed_ground, speed_air, and distance:

##              speed_ground speed_air  distance
## speed_ground    1.0000000 0.9889690 0.9317169
## speed_air       0.9889690 1.0000000 0.9452968
## distance        0.9317169 0.9452968 1.0000000

When comparing these three models, we observe that the coefficient for speed_ground undergoes significant changes in both magnitude and sign. Similarly, the coefficient for speed_air also changes, though to a lesser extent than speed_ground. Ideally, adding or removing covariates in a regression model should not cause substantial shifts in the values of regressor coefficients. When such drastic changes occur, it is a strong indication of multicollinearity among the predictors.

Let’s examine the collinearity between the two covariates. These variables are highly correlated, with an R-value of 0.989. To avoid multicollinearity, only one of these variables should be included in the final model. The variable with the higher Variance Inflation Factor (VIF) would typically be dropped.

In this case, both variables have the same VIF. Therefore, we can choose to drop the variable that is less correlated with the response variable. Since speed_ground is slightly less correlated with distance compared to speed_air (though the difference is minimal), we decide to exclude speed_ground from the model.

Variable Selection based on our ranking in Table 0.

Step 17

We will now fit multiple models, each model having one more covariate than the preceding model.

cov	rsqd
1	0.8936
2	0.9513
3	0.9751
4	0.9751
5	0.9752
6	0.9754
7	0.9754

Step 18

Let us plot the number of covariates against Adjusted R-Squared, a statistic that accounts for the number of predictors in the model by penalizing models with an excessive number of covariates.

As shown below, it is common to observe a pattern where adding more covariates to a multiple linear regression model increases the R-squared value. However, Adjusted R-Squared provides a more reliable measure by balancing model complexity and goodness-of-fit.

cov	adj.rsqd
1	0.8930
2	0.9507
3	0.9747
4	0.9746
5	0.9745
6	0.9746
7	0.9745

Step 19

Finally, let us plot the number of covariates against another evaluation metric, AIC (Akaike Information Criterion). A lower AIC value indicates a better-fitting model, as it balances goodness-of-fit with model complexity.

cov	aic.fig
1	2,773.274
2	2,622.261
3	2,492.778
4	2,494.348
5	2,495.985
6	2,496.322
7	2,498.048

Step 20

All three evaluation metrics—R-Squared, Adjusted R-Squared, and AIC—identify Model 3 as the best predictive model for landing distance (LD). Model 3 achieves the highest Adjusted R-Squared and the lowest AIC among the models reviewed.

Based on this analysis, the three variables selected for building a predictive model for LD are speed_air, aircraft, and height.

## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = FAA_merge)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -304.14  -93.22   14.88   91.95  425.03 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6484.5380   110.1352  -58.88   <2e-16 ***
## speed_air      82.8340     0.9769   84.79   <2e-16 ***
## aircraft      439.0176    20.2013   21.73   <2e-16 ***
## height         14.2244     1.0500   13.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 137.7 on 192 degrees of freedom
##   (591 observations deleted due to missingness)
## Multiple R-squared:  0.9751, Adjusted R-squared:  0.9747 
## F-statistic:  2504 on 3 and 192 DF,  p-value: < 2.2e-16

Variable Selection based on Automatic Algorithm

Step 21

We can use the R function “StepAIC” to perform forward variable selection and automatically select the best model.

## Start:  AIC=1939.82
## distance ~ speed_air + aircraft + height + pitch + duration + 
##     no_pasg + speed_ground
## 
##                Df Sum of Sq      RSS    AIC
## - duration      1      3427  3593003 1938.0
## - speed_ground  1      5011  3594587 1938.1
## - pitch         1      9376  3598952 1938.3
## - no_pasg       1     30384  3619959 1939.5
## <none>                       3589575 1939.8
## - speed_air     1   3161032  6750608 2061.6
## - height        1   3410044  6999620 2068.7
## - aircraft      1   7904682 11494257 2165.9
## 
## Step:  AIC=1938.01
## distance ~ speed_air + aircraft + height + pitch + no_pasg + 
##     speed_ground
## 
##                Df Sum of Sq      RSS    AIC
## - speed_ground  1      6293  3599296 1936.3
## - pitch         1     10041  3603043 1936.6
## - no_pasg       1     32056  3625059 1937.8
## <none>                       3593003 1938.0
## - speed_air     1   3250783  6843786 2062.3
## - height        1   3434796  7027798 2067.5
## - aircraft      1   7901833 11494836 2163.9
## 
## Step:  AIC=1936.35
## distance ~ speed_air + aircraft + height + pitch + no_pasg
## 
##             Df Sum of Sq       RSS    AIC
## - pitch      1      8583   3607879 1934.8
## - no_pasg    1     32652   3631948 1936.1
## <none>                     3599296 1936.3
## - height     1   3469592   7068888 2066.7
## - aircraft   1   7897070  11496366 2162.0
## - speed_air  1 136189067 139788363 2651.6
## 
## Step:  AIC=1934.82
## distance ~ speed_air + aircraft + height + no_pasg
## 
##             Df Sum of Sq       RSS    AIC
## - no_pasg    1     32041   3639920 1934.5
## <none>                     3607879 1934.8
## - height     1   3476420   7084299 2065.1
## - aircraft   1   8871841  12479720 2176.1
## - speed_air  1 136312487 139920365 2649.8
## 
## Step:  AIC=1934.55
## distance ~ speed_air + aircraft + height
## 
##             Df Sum of Sq       RSS    AIC
## <none>                     3639920 1934.5
## - height     1   3479278   7119197 2064.0
## - aircraft   1   8953580  12593499 2175.8
## - speed_air  1 136291471 139931391 2647.8

## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = FAA_merge)
## 
## Coefficients:
## (Intercept)    speed_air     aircraft       height  
##    -6484.54        82.83       439.02        14.22

The stepAIC function confirms that the selected model remains unchanged.The stepwise selection process identifies the same model with an AIC value of 1934.55

distance ~ speed_air + aircraft + height

This result reaffirms the appropriateness of these variables (speed_air, aircraft, and height) for predicting landing distance.

FAA Project Part 1

Silpa Prakash Rao (M16141545)