Initial exploration of the data

library(readxl)

FAA1 <- read_excel("C:/Users/Center/Downloads/FAA1.xls")
FAA2 <- read_excel("C:/Users/Center/Downloads/FAA2.xls")
str(FAA1)
## tibble [800 x 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:800] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:800] 109 103 NA NA NA ...
##  $ height      : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:800] 3370 2988 1145 1664 1050 ...
str(FAA2)
## tibble [150 x 7] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:150] 109 103 NA NA NA ...
##  $ height      : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:150] 3370 2988 1145 1664 1050 ...
dim(FAA1)
## [1] 800   8
dim(FAA2)
## [1] 150   7
colnames(FAA1)
## [1] "aircraft"     "duration"     "no_pasg"      "speed_ground" "speed_air"   
## [6] "height"       "pitch"        "distance"
colnames(FAA2)
## [1] "aircraft"     "no_pasg"      "speed_ground" "speed_air"    "height"      
## [6] "pitch"        "distance"

FAA1 : 800 samples, 8 variables FAA2 : 150 samples. 7 variables Yes, the duration variable was missing in FAA2 dataset apart from the difference in the sample size.

FAA <- merge(FAA1,FAA2)
str(FAA)
## 'data.frame':    100 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num  44 44 46 47 47 49 49 49 49 51 ...
##  $ speed_ground: num  72.5 95.1 39.8 90.4 95.3 ...
##  $ speed_air   : num  NA 96.1 NA NA 94.2 ...
##  $ height      : num  42.9 17.4 39.7 14.1 30.3 ...
##  $ pitch       : num  4.03 3.2 4.6 4.04 3.65 ...
##  $ distance    : num  1321 2184 1030 1593 2233 ...
##  $ duration    : num  124.9 83.5 142.2 146.4 156.8 ...

Yes, there are duplications. 100 entries were found that contain identical values. Final merged dataset can be obtained by following:

new_FAA <- merge(FAA1, FAA2, all = TRUE)
str(new_FAA)
## 'data.frame':    850 obs. of  8 variables:
##  $ aircraft    : chr  "airbus" "airbus" "airbus" "airbus" ...
##  $ no_pasg     : num  36 38 40 41 43 44 45 45 45 45 ...
##  $ speed_ground: num  47.5 85.2 80.6 97.6 82.5 ...
##  $ speed_air   : num  NA NA NA 97 NA ...
##  $ height      : num  14 37 28.6 38.4 30.1 ...
##  $ pitch       : num  4.3 4.12 3.62 3.53 4.09 ...
##  $ distance    : num  251 1257 1021 2168 1321 ...
##  $ duration    : num  172 188 93.5 123.3 109.2 ...
str(new_FAA)
## 'data.frame':    850 obs. of  8 variables:
##  $ aircraft    : chr  "airbus" "airbus" "airbus" "airbus" ...
##  $ no_pasg     : num  36 38 40 41 43 44 45 45 45 45 ...
##  $ speed_ground: num  47.5 85.2 80.6 97.6 82.5 ...
##  $ speed_air   : num  NA NA NA 97 NA ...
##  $ height      : num  14 37 28.6 38.4 30.1 ...
##  $ pitch       : num  4.3 4.12 3.62 3.53 4.09 ...
##  $ distance    : num  251 1257 1021 2168 1321 ...
##  $ duration    : num  172 188 93.5 123.3 109.2 ...
dim(new_FAA)
## [1] 850   8

There are 850 observations with 8 variables in the new FAA dataset. Following is the summary statistics for each variable.

summary(new_FAA)
##    aircraft            no_pasg      speed_ground      speed_air     
##  Length:850         Min.   :29.0   Min.   : 27.74   Min.   : 90.00  
##  Class :character   1st Qu.:55.0   1st Qu.: 65.90   1st Qu.: 96.25  
##  Mode  :character   Median :60.0   Median : 79.64   Median :101.15  
##                     Mean   :60.1   Mean   : 79.45   Mean   :103.80  
##                     3rd Qu.:65.0   3rd Qu.: 92.06   3rd Qu.:109.40  
##                     Max.   :87.0   Max.   :141.22   Max.   :141.72  
##                                                     NA's   :642     
##      height           pitch          distance          duration     
##  Min.   :-3.546   Min.   :2.284   Min.   :  34.08   Min.   : 14.76  
##  1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79   1st Qu.:119.49  
##  Median :30.093   Median :4.008   Median :1258.09   Median :153.95  
##  Mean   :30.144   Mean   :4.009   Mean   :1526.02   Mean   :154.01  
##  3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95   3rd Qu.:188.91  
##  Max.   :59.946   Max.   :5.927   Max.   :6533.05   Max.   :305.62  
##                                                     NA's   :50

Ans. There are crucial observations based out of the summary statistics generated from the dataset.

  1. The duration column has a minimum of 14.76 minutes with 50 missing values. A normal flight duration should be greater than 40 minutes.
  2. Faulty data present in the sample since the height attribute has a minimum value of -3.546 which is abnormal.
  3. The maximum landing distance is 6533.05 feet which is beyond the usual 6000 feet.
  4. There are potential outliers in the sample in distance
  5. There are 642 missing values present in speed_air and 27.74 as the minimum speed_ground which is less than the specified 30 MPH. Max values are beyond the 140 MPH threshold for both variables as well.

Data Cleaning and further exploration

Following are the conditions to identify “abnormal landing”:

  1. duration < 40
  2. Speed_ground < 30 or > 140
  3. Speed_air < 30 or > 140
  4. Height < 6
  5. distance > 6000
# criteria 1
a <- which(new_FAA$duration < 40)
b <- which(new_FAA$speed_ground < 30)
c <- which(new_FAA$speed_ground > 140)
d <- which(new_FAA$speed_air < 30)
e <- which(new_FAA$speed_air > 140)
f <- which(new_FAA$height < 6)
g <- which(new_FAA$distance > 6000)

new_FAA <- new_FAA[-c(a,b,c,d,e,f,g), ]
dim(new_FAA)
## [1] 831   8

Total rows removed: 950 - 831 = 19 rows

str(new_FAA)
## 'data.frame':    831 obs. of  8 variables:
##  $ aircraft    : chr  "airbus" "airbus" "airbus" "airbus" ...
##  $ no_pasg     : num  36 38 40 41 43 44 45 45 45 45 ...
##  $ speed_ground: num  47.5 85.2 80.6 97.6 82.5 ...
##  $ speed_air   : num  NA NA NA 97 NA ...
##  $ height      : num  14 37 28.6 38.4 30.1 ...
##  $ pitch       : num  4.3 4.12 3.62 3.53 4.09 ...
##  $ distance    : num  251 1257 1021 2168 1321 ...
##  $ duration    : num  172 188 93.5 123.3 109.2 ...
dim(new_FAA)
## [1] 831   8
summary(new_FAA)
##    aircraft            no_pasg       speed_ground      speed_air     
##  Length:831         Min.   :29.00   Min.   : 33.57   Min.   : 90.00  
##  Class :character   1st Qu.:55.00   1st Qu.: 66.20   1st Qu.: 96.23  
##  Mode  :character   Median :60.00   Median : 79.79   Median :101.12  
##                     Mean   :60.06   Mean   : 79.54   Mean   :103.48  
##                     3rd Qu.:65.00   3rd Qu.: 91.91   3rd Qu.:109.36  
##                     Max.   :87.00   Max.   :132.78   Max.   :132.91  
##                                                      NA's   :628     
##      height           pitch          distance          duration     
##  Min.   : 6.228   Min.   :2.284   Min.   :  41.72   Min.   : 41.95  
##  1st Qu.:23.530   1st Qu.:3.640   1st Qu.: 893.28   1st Qu.:119.63  
##  Median :30.167   Median :4.001   Median :1262.15   Median :154.28  
##  Mean   :30.458   Mean   :4.005   Mean   :1522.48   Mean   :154.78  
##  3rd Qu.:37.004   3rd Qu.:4.370   3rd Qu.:1936.63   3rd Qu.:189.66  
##  Max.   :59.946   Max.   :5.927   Max.   :5381.96   Max.   :305.62  
##                                                     NA's   :50
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.0.5
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
hist.data.frame(new_FAA)

Following are the findings:

  1. 831 observations with 8 variables
  2. Min of landing distance is 41.72 and Max is 5381.96
  3. 628 missing values in speed_air variable
  4. Height of the flight within the defined limits
  5. Right skew observed in the distributions of speed_air and distance variables

Initial analysis for identifying important factors that impact the response variable distance

# impute mean value for NAs in duration
new_FAA$duration[is.na(new_FAA$duration)] <- mean(new_FAA$duration[!is.na(new_FAA$duration)])

# conduct regression imputation for speed_air since 75% data is missing
encode <- function(column)
{
  encoder = length(column)
  encoder[which(!is.na(column))] = 1
  encoder[which(is.na(column))] = 0
  return(encoder)
}

new_FAA$impute_location <- encode(new_FAA$speed_air)
imputed_reg <- lm(new_FAA$speed_air ~ new_FAA$speed_ground)

for (i in 1:nrow(new_FAA))
{
  if (new_FAA$impute_location[i] == 0)
  {
    new_FAA$speed_air[i] = 2.93 +0.98*new_FAA$speed_ground[i]
  }
}
# calculate correlation factor between distance and aircraft type
aircraft_corr <- cor(new_FAA$distance, as.numeric(as.factor(new_FAA$aircraft)))

# calculate pairwise correlation between distance and rest all other variables
# omit NAs for the purpose of correlation
cor_duration <- cor(new_FAA$distance, new_FAA$duration)
cor_speedground <- cor(new_FAA$distance, new_FAA$speed_ground)
cor_pasg <- cor(new_FAA$distance, new_FAA$no_pasg)
cor_speedair <- cor(new_FAA$distance, new_FAA$speed_air)
cor_height <- cor(new_FAA$distance, new_FAA$height)
cor_pitch <- cor(new_FAA$distance, new_FAA$pitch)

corr_size <- c(aircraft_corr, cor_duration, cor_speedground, cor_pasg, cor_speedair, cor_height, cor_pitch)
direction <- ifelse(corr_size > 0, "Positive", "Negative")
names <- c("aircraft","duration","speed_ground","no_pasg","speed_air","height","pitch")
Table1 <- data.frame(names, corr_size, direction)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Table1 <- arrange(Table1, desc(abs(corr_size)))
Table1
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(data = new_FAA)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Yes, the values obtained are in consistence with the previous step.

Step 12. Have you included the airplane make as a possible factor in Steps 10-11? You can code this character variable as 0/1.

Yes, the airplane variable was converted into factor and then into numeric in Step 10.

new_FAA$aircraft <- as.factor(new_FAA$aircraft)
levels(new_FAA$aircraft) <- c(1, 0)
head(new_FAA)

Regression using a single factor each time

aircraft <- lm(new_FAA$distance ~ new_FAA$aircraft)
duration <- lm(new_FAA$distance ~ new_FAA$duration)
no_pasg <- lm(new_FAA$distance ~ new_FAA$no_pasg)
speed_ground <- lm(new_FAA$distance ~ new_FAA$speed_ground)
speed_air <- lm(new_FAA$distance ~ new_FAA$speed_air)
height <- lm(new_FAA$distance ~ new_FAA$height)
pitch <- lm(new_FAA$distance ~ new_FAA$pitch)

coeff_aircraft <- summary(aircraft)$coefficients[2,4]
coeff_duration <- summary(duration)$coefficients[2,4]
coeff_no_pasg <- summary(no_pasg)$coefficients[2,4]
coeff_speed_ground <- summary(speed_ground)$coefficients[2,4]
coeff_speed_air <- summary(speed_air)$coefficients[2,4]
coeff_height <- summary(height)$coefficients[2,4]
coeff_pitch <- summary(pitch)$coefficients[2,4]

names <- c("aircraft", "duration","no_pasg","speed_ground","speed_air","height","pitch")
coefficient_size <- c(coeff_aircraft, coeff_duration, coeff_no_pasg, coeff_speed_ground, coeff_speed_air, coeff_height, coeff_pitch)
direction <- ifelse(coefficient_size > 0, "Positive", "Negative")

Table2 <- data.frame(names, coefficient_size, direction)
names(Table2) <- c('names of variables', 'size_pvalue', 'direction of the regressioncoeff')
Table2 <- Table2[order(Table2$size_pvalue), , drop = FALSE]
Table2
normalize <- function(column){
  return ((na.omit(column) - mean(column, na.rm = T)) / sd(na.omit(column)))
}

aircraft_lm <- lm(new_FAA$distance ~ normalize(as.numeric(new_FAA$aircraft)))
duration_lm <- lm(new_FAA$distance ~ normalize(new_FAA$duration))
no_pasg_lm <- lm(new_FAA$distance ~ normalize(new_FAA$no_pasg))
speed_ground_lm <- lm(new_FAA$distance ~ normalize(new_FAA$speed_ground))
speed_air_lm <- lm(new_FAA$distance ~ normalize(new_FAA$speed_air))
height_lm <- lm(new_FAA$distance ~ normalize(new_FAA$height))
pitch_lm <- lm(new_FAA$distance ~ normalize(new_FAA$pitch))

coeff_aircraft <- summary(aircraft_lm)$coefficients[2,1]
coeff_duration <- summary(duration_lm)$coefficients[2,1]
coeff_no_pasg <- summary(no_pasg_lm)$coefficients[2,1]
coeff_speed_ground <- summary(speed_ground_lm)$coefficients[2,1]
coeff_speed_air <- summary(speed_air_lm)$coefficients[2,1]
coeff_height <- summary(height_lm)$coefficients[2,1]
coeff_pitch <- summary(pitch_lm)$coefficients[2,1]

names <- c("aircraft", "duration","no_pasg","speed_ground","speed_air","height","pitch")
coefficient_size <- c(coeff_aircraft, coeff_duration, coeff_no_pasg, coeff_speed_ground, coeff_speed_air, coeff_height, coeff_pitch)
direction <- ifelse(coefficient_size > 0, "Positive", "Negative")

Table3 <- data.frame(names, coefficient_size, direction)
names(Table3) <- c('names of variables', 'coefficient_value', 'direction of the regressioncoeff')
Table3 <- arrange(Table3, desc(Table3$coefficient_value))
Table3
Table0 <- data.frame(Table1[1], Table2[1], Table3[1])
Table0 <- Table0[1]
Table0

The results are consistent across the three tables and the variables are arranged based on output from the three tables.

Check collinearity

Do you observe any significance change and sign change? Check the correlation between Speed_ground and Speed_air. You may want to keep one of them in the model selection. Which one would you pick? Why?

model3 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$speed_air, data = new_FAA)

summary(speed_ground)
## 
## Call:
## lm(formula = new_FAA$distance ~ new_FAA$speed_ground)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -897.09 -319.16  -72.09  210.83 1798.88 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1773.9407    67.8388  -26.15   <2e-16 ***
## new_FAA$speed_ground    41.4422     0.8302   49.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 448.1 on 829 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7501 
## F-statistic:  2492 on 1 and 829 DF,  p-value: < 2.2e-16
summary(speed_air)
## 
## Call:
## lm(formula = new_FAA$distance ~ new_FAA$speed_air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -903.06 -320.50  -70.28  215.92 1817.21 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1930.1608    71.2487  -27.09   <2e-16 ***
## new_FAA$speed_air    42.7893     0.8616   49.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 449.8 on 829 degrees of freedom
## Multiple R-squared:  0.7485, Adjusted R-squared:  0.7481 
## F-statistic:  2467 on 1 and 829 DF,  p-value: < 2.2e-16
summary(model3)
## 
## Call:
## lm(formula = new_FAA$distance ~ new_FAA$speed_ground + new_FAA$speed_air, 
##     data = new_FAA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -896.00 -317.85  -72.66  207.47 1796.14 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1741.513    102.261 -17.030   <2e-16 ***
## new_FAA$speed_ground    49.644     19.365   2.564   0.0105 *  
## new_FAA$speed_air       -8.488     20.020  -0.424   0.6717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 448.3 on 828 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7498 
## F-statistic:  1245 on 2 and 828 DF,  p-value: < 2.2e-16
cor(new_FAA$speed_air, new_FAA$speed_ground)
## [1] 0.9990797

The Model 3: LD ~ Speed_ground + Speed_air has the sign of the speed_air change from positive to negative. We get a positive correlation between the speed_ground and speed_air to establish a strong positive linear association between the variables, therefore collinearity exists in this situation.

Based on the Adjusted R-squared, I will keep the Model 1: LD ~ Speed_ground.

Variable selection based on our ranking in Table 0.

Calculate the R-squared for each model. Plot these R-squared values versus the number of variables p. What patterns do you observe?

Table0
model1 <- lm(new_FAA$distance ~ new_FAA$speed_ground)
model2 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft)
model3 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + new_FAA$height)
model4 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + new_FAA$height + new_FAA$pitch)
model5 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + new_FAA$height + new_FAA$pitch + new_FAA$duration)
model6 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + new_FAA$height + new_FAA$pitch + new_FAA$duration + new_FAA$no_pasg)

models<-c("model 1","model 2","model 3","model 4","model 5","model 6")
rsq_val <- c(summary(model1)$r.squared, summary(model2)$r.squared, summary(model3)$r.squared, summary(model4)$r.squared, summary(model5)$r.squared, summary(model6)$r.squared)
variable_count <-c(1,2,3,4,5,6)
compare <- data.frame(models, rsq_val, variable_count)
compare
#plot
ggplot(compare, aes(x = variable_count, y = rsq_val)) + 
  geom_point() +
    labs(x = "Number of Variables", y = "R-squared Value")

As we add more variables into the model, the R-squared value increases.

adj_rsq_val <- c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared, summary(model3)$adj.r.squared, summary(model4)$adj.r.squared, summary(model5)$adj.r.squared, summary(model6)$adj.r.squared)
variable_count <-c(1,2,3,4,5,6)
compare <- data.frame(models, adj_rsq_val, variable_count)
compare
#plot
ggplot(compare, aes(x = variable_count, y = adj_rsq_val)) + 
  geom_point() +
    labs(x = "Number of Variables", y = "Adjusted R-squared Value")

The adjusted R-squared parameter remains comparable after addition of 4 variables and beyond. The four variables are speed_ground,aircraft,height and pitch.

aic1 <- AIC(model1)
aic2 <- AIC(model2)
aic3 <- AIC(model3)
aic4 <- AIC(model4)
aic5 <- AIC(model5)
aic6 <- AIC(model6)

aic <- c(aic1, aic2, aic3, aic4, aic5, aic6)
variable_count <-c(1,2,3,4,5,6)
compare <- data.frame(models, aic, variable_count)
compare <- arrange(compare, -desc(aic))
compare
#plot
ggplot(compare, aes(x = variable_count, y = aic)) + 
  geom_point() +
    labs(x = "Number of Variables", y = "AIC Value")

The Model 4 is chosen since it produces the lowest AIC value with 12095.05. The four variables are speed_ground, aircraft, height and pitch.

Steps 18 and 19 result in similar result output. We will select speed_ground, aircraft, height and pitch as variables to build a predictive model for LD.

Variable selection based on automate algorithm. Step 21. Use the R function “StepAIC” to perform forward variable selection. Compare the result with that in Step 19.

selection <- lm(new_FAA$distance~1)
step(selection, direction = "forward", scope = formula(model6))
## Start:  AIC=11299.8
## new_FAA$distance ~ 1
## 
##                        Df Sum of Sq       RSS   AIC
## + new_FAA$speed_ground  1 500382567 166457762 10148
## + new_FAA$aircraft      1  37818390 629021939 11253
## + new_FAA$height        1   6590108 660250221 11294
## + new_FAA$pitch         1   5050617 661789712 11296
## + new_FAA$duration      1   1685114 665155215 11300
## <none>                              666840329 11300
## + new_FAA$no_pasg       1    210253 666630076 11302
## 
## Step:  AIC=10148.53
## new_FAA$distance ~ new_FAA$speed_ground
## 
##                    Df Sum of Sq       RSS     AIC
## + new_FAA$aircraft  1  49848656 116609106  9854.8
## + new_FAA$height    1  14916377 151541385 10072.5
## + new_FAA$pitch     1   9765095 156692668 10100.3
## <none>                          166457762 10148.5
## + new_FAA$no_pasg   1    207528 166250234 10149.5
## + new_FAA$duration  1     51669 166406094 10150.3
## 
## Step:  AIC=9854.77
## new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft
## 
##                    Df Sum of Sq       RSS    AIC
## + new_FAA$height    1  15848830 100760276 9735.4
## + new_FAA$pitch     1    455453 116153653 9853.5
## <none>                          116609106 9854.8
## + new_FAA$no_pasg   1     87171 116521935 9856.1
## + new_FAA$duration  1      8445 116600661 9856.7
## 
## Step:  AIC=9735.37
## new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + 
##     new_FAA$height
## 
##                    Df Sum of Sq       RSS    AIC
## + new_FAA$pitch     1    315259 100445017 9734.8
## <none>                          100760276 9735.4
## + new_FAA$no_pasg   1    232003 100528273 9735.5
## + new_FAA$duration  1      3976 100756300 9737.3
## 
## Step:  AIC=9734.77
## new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + 
##     new_FAA$height + new_FAA$pitch
## 
##                    Df Sum of Sq       RSS    AIC
## <none>                          100445017 9734.8
## + new_FAA$no_pasg   1    225608 100219409 9734.9
## + new_FAA$duration  1      6696 100438321 9736.7
## 
## Call:
## lm(formula = new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + 
##     new_FAA$height + new_FAA$pitch)
## 
## Coefficients:
##          (Intercept)  new_FAA$speed_ground     new_FAA$aircraft0  
##             -2664.32                 42.43                481.27  
##       new_FAA$height         new_FAA$pitch  
##                14.09                 39.61

The Step forward AIC algorithm suggests to use all the variables 9734.77. The variables picked in Step 21 are consistent with the ones found in Step 18-20.