Initial exploration of the data
library(readxl)
FAA1 <- read_excel("C:/Users/Center/Downloads/FAA1.xls")
FAA2 <- read_excel("C:/Users/Center/Downloads/FAA2.xls")
str(FAA1)
## tibble [800 x 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:800] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:800] 109 103 NA NA NA ...
## $ height : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:800] 3370 2988 1145 1664 1050 ...
str(FAA2)
## tibble [150 x 7] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:150] 109 103 NA NA NA ...
## $ height : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:150] 3370 2988 1145 1664 1050 ...
dim(FAA1)
## [1] 800 8
dim(FAA2)
## [1] 150 7
colnames(FAA1)
## [1] "aircraft" "duration" "no_pasg" "speed_ground" "speed_air"
## [6] "height" "pitch" "distance"
colnames(FAA2)
## [1] "aircraft" "no_pasg" "speed_ground" "speed_air" "height"
## [6] "pitch" "distance"
FAA1 : 800 samples, 8 variables FAA2 : 150 samples. 7 variables Yes, the duration variable was missing in FAA2 dataset apart from the difference in the sample size.
FAA <- merge(FAA1,FAA2)
str(FAA)
## 'data.frame': 100 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num 44 44 46 47 47 49 49 49 49 51 ...
## $ speed_ground: num 72.5 95.1 39.8 90.4 95.3 ...
## $ speed_air : num NA 96.1 NA NA 94.2 ...
## $ height : num 42.9 17.4 39.7 14.1 30.3 ...
## $ pitch : num 4.03 3.2 4.6 4.04 3.65 ...
## $ distance : num 1321 2184 1030 1593 2233 ...
## $ duration : num 124.9 83.5 142.2 146.4 156.8 ...
Yes, there are duplications. 100 entries were found that contain identical values. Final merged dataset can be obtained by following:
new_FAA <- merge(FAA1, FAA2, all = TRUE)
str(new_FAA)
## 'data.frame': 850 obs. of 8 variables:
## $ aircraft : chr "airbus" "airbus" "airbus" "airbus" ...
## $ no_pasg : num 36 38 40 41 43 44 45 45 45 45 ...
## $ speed_ground: num 47.5 85.2 80.6 97.6 82.5 ...
## $ speed_air : num NA NA NA 97 NA ...
## $ height : num 14 37 28.6 38.4 30.1 ...
## $ pitch : num 4.3 4.12 3.62 3.53 4.09 ...
## $ distance : num 251 1257 1021 2168 1321 ...
## $ duration : num 172 188 93.5 123.3 109.2 ...
str(new_FAA)
## 'data.frame': 850 obs. of 8 variables:
## $ aircraft : chr "airbus" "airbus" "airbus" "airbus" ...
## $ no_pasg : num 36 38 40 41 43 44 45 45 45 45 ...
## $ speed_ground: num 47.5 85.2 80.6 97.6 82.5 ...
## $ speed_air : num NA NA NA 97 NA ...
## $ height : num 14 37 28.6 38.4 30.1 ...
## $ pitch : num 4.3 4.12 3.62 3.53 4.09 ...
## $ distance : num 251 1257 1021 2168 1321 ...
## $ duration : num 172 188 93.5 123.3 109.2 ...
dim(new_FAA)
## [1] 850 8
There are 850 observations with 8 variables in the new FAA dataset. Following is the summary statistics for each variable.
summary(new_FAA)
## aircraft no_pasg speed_ground speed_air
## Length:850 Min. :29.0 Min. : 27.74 Min. : 90.00
## Class :character 1st Qu.:55.0 1st Qu.: 65.90 1st Qu.: 96.25
## Mode :character Median :60.0 Median : 79.64 Median :101.15
## Mean :60.1 Mean : 79.45 Mean :103.80
## 3rd Qu.:65.0 3rd Qu.: 92.06 3rd Qu.:109.40
## Max. :87.0 Max. :141.22 Max. :141.72
## NA's :642
## height pitch distance duration
## Min. :-3.546 Min. :2.284 Min. : 34.08 Min. : 14.76
## 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79 1st Qu.:119.49
## Median :30.093 Median :4.008 Median :1258.09 Median :153.95
## Mean :30.144 Mean :4.009 Mean :1526.02 Mean :154.01
## 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95 3rd Qu.:188.91
## Max. :59.946 Max. :5.927 Max. :6533.05 Max. :305.62
## NA's :50
Ans. There are crucial observations based out of the summary statistics generated from the dataset.
duration column has a minimum of 14.76 minutes with 50 missing values. A normal flight duration should be greater than 40 minutes.height attribute has a minimum value of -3.546 which is abnormal.distance is 6533.05 feet which is beyond the usual 6000 feet.distancespeed_air and 27.74 as the minimum speed_ground which is less than the specified 30 MPH. Max values are beyond the 140 MPH threshold for both variables as well.Data Cleaning and further exploration
Following are the conditions to identify “abnormal landing”:
# criteria 1
a <- which(new_FAA$duration < 40)
b <- which(new_FAA$speed_ground < 30)
c <- which(new_FAA$speed_ground > 140)
d <- which(new_FAA$speed_air < 30)
e <- which(new_FAA$speed_air > 140)
f <- which(new_FAA$height < 6)
g <- which(new_FAA$distance > 6000)
new_FAA <- new_FAA[-c(a,b,c,d,e,f,g), ]
dim(new_FAA)
## [1] 831 8
Total rows removed: 950 - 831 = 19 rows
str(new_FAA)
## 'data.frame': 831 obs. of 8 variables:
## $ aircraft : chr "airbus" "airbus" "airbus" "airbus" ...
## $ no_pasg : num 36 38 40 41 43 44 45 45 45 45 ...
## $ speed_ground: num 47.5 85.2 80.6 97.6 82.5 ...
## $ speed_air : num NA NA NA 97 NA ...
## $ height : num 14 37 28.6 38.4 30.1 ...
## $ pitch : num 4.3 4.12 3.62 3.53 4.09 ...
## $ distance : num 251 1257 1021 2168 1321 ...
## $ duration : num 172 188 93.5 123.3 109.2 ...
dim(new_FAA)
## [1] 831 8
summary(new_FAA)
## aircraft no_pasg speed_ground speed_air
## Length:831 Min. :29.00 Min. : 33.57 Min. : 90.00
## Class :character 1st Qu.:55.00 1st Qu.: 66.20 1st Qu.: 96.23
## Mode :character Median :60.00 Median : 79.79 Median :101.12
## Mean :60.06 Mean : 79.54 Mean :103.48
## 3rd Qu.:65.00 3rd Qu.: 91.91 3rd Qu.:109.36
## Max. :87.00 Max. :132.78 Max. :132.91
## NA's :628
## height pitch distance duration
## Min. : 6.228 Min. :2.284 Min. : 41.72 Min. : 41.95
## 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.28 1st Qu.:119.63
## Median :30.167 Median :4.001 Median :1262.15 Median :154.28
## Mean :30.458 Mean :4.005 Mean :1522.48 Mean :154.78
## 3rd Qu.:37.004 3rd Qu.:4.370 3rd Qu.:1936.63 3rd Qu.:189.66
## Max. :59.946 Max. :5.927 Max. :5381.96 Max. :305.62
## NA's :50
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.0.5
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
hist.data.frame(new_FAA)
Following are the findings:
speed_air variablespeed_air and distance variablesInitial analysis for identifying important factors that impact the response variable distance
# impute mean value for NAs in duration
new_FAA$duration[is.na(new_FAA$duration)] <- mean(new_FAA$duration[!is.na(new_FAA$duration)])
# conduct regression imputation for speed_air since 75% data is missing
encode <- function(column)
{
encoder = length(column)
encoder[which(!is.na(column))] = 1
encoder[which(is.na(column))] = 0
return(encoder)
}
new_FAA$impute_location <- encode(new_FAA$speed_air)
imputed_reg <- lm(new_FAA$speed_air ~ new_FAA$speed_ground)
for (i in 1:nrow(new_FAA))
{
if (new_FAA$impute_location[i] == 0)
{
new_FAA$speed_air[i] = 2.93 +0.98*new_FAA$speed_ground[i]
}
}
# calculate correlation factor between distance and aircraft type
aircraft_corr <- cor(new_FAA$distance, as.numeric(as.factor(new_FAA$aircraft)))
# calculate pairwise correlation between distance and rest all other variables
# omit NAs for the purpose of correlation
cor_duration <- cor(new_FAA$distance, new_FAA$duration)
cor_speedground <- cor(new_FAA$distance, new_FAA$speed_ground)
cor_pasg <- cor(new_FAA$distance, new_FAA$no_pasg)
cor_speedair <- cor(new_FAA$distance, new_FAA$speed_air)
cor_height <- cor(new_FAA$distance, new_FAA$height)
cor_pitch <- cor(new_FAA$distance, new_FAA$pitch)
corr_size <- c(aircraft_corr, cor_duration, cor_speedground, cor_pasg, cor_speedair, cor_height, cor_pitch)
direction <- ifelse(corr_size > 0, "Positive", "Negative")
names <- c("aircraft","duration","speed_ground","no_pasg","speed_air","height","pitch")
Table1 <- data.frame(names, corr_size, direction)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Table1 <- arrange(Table1, desc(abs(corr_size)))
Table1
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(data = new_FAA)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Yes, the values obtained are in consistence with the previous step.
Step 12. Have you included the airplane make as a possible factor in Steps 10-11? You can code this character variable as 0/1.
Yes, the airplane variable was converted into factor and then into numeric in Step 10.
new_FAA$aircraft <- as.factor(new_FAA$aircraft)
levels(new_FAA$aircraft) <- c(1, 0)
head(new_FAA)
Regression using a single factor each time
aircraft <- lm(new_FAA$distance ~ new_FAA$aircraft)
duration <- lm(new_FAA$distance ~ new_FAA$duration)
no_pasg <- lm(new_FAA$distance ~ new_FAA$no_pasg)
speed_ground <- lm(new_FAA$distance ~ new_FAA$speed_ground)
speed_air <- lm(new_FAA$distance ~ new_FAA$speed_air)
height <- lm(new_FAA$distance ~ new_FAA$height)
pitch <- lm(new_FAA$distance ~ new_FAA$pitch)
coeff_aircraft <- summary(aircraft)$coefficients[2,4]
coeff_duration <- summary(duration)$coefficients[2,4]
coeff_no_pasg <- summary(no_pasg)$coefficients[2,4]
coeff_speed_ground <- summary(speed_ground)$coefficients[2,4]
coeff_speed_air <- summary(speed_air)$coefficients[2,4]
coeff_height <- summary(height)$coefficients[2,4]
coeff_pitch <- summary(pitch)$coefficients[2,4]
names <- c("aircraft", "duration","no_pasg","speed_ground","speed_air","height","pitch")
coefficient_size <- c(coeff_aircraft, coeff_duration, coeff_no_pasg, coeff_speed_ground, coeff_speed_air, coeff_height, coeff_pitch)
direction <- ifelse(coefficient_size > 0, "Positive", "Negative")
Table2 <- data.frame(names, coefficient_size, direction)
names(Table2) <- c('names of variables', 'size_pvalue', 'direction of the regressioncoeff')
Table2 <- Table2[order(Table2$size_pvalue), , drop = FALSE]
Table2
normalize <- function(column){
return ((na.omit(column) - mean(column, na.rm = T)) / sd(na.omit(column)))
}
aircraft_lm <- lm(new_FAA$distance ~ normalize(as.numeric(new_FAA$aircraft)))
duration_lm <- lm(new_FAA$distance ~ normalize(new_FAA$duration))
no_pasg_lm <- lm(new_FAA$distance ~ normalize(new_FAA$no_pasg))
speed_ground_lm <- lm(new_FAA$distance ~ normalize(new_FAA$speed_ground))
speed_air_lm <- lm(new_FAA$distance ~ normalize(new_FAA$speed_air))
height_lm <- lm(new_FAA$distance ~ normalize(new_FAA$height))
pitch_lm <- lm(new_FAA$distance ~ normalize(new_FAA$pitch))
coeff_aircraft <- summary(aircraft_lm)$coefficients[2,1]
coeff_duration <- summary(duration_lm)$coefficients[2,1]
coeff_no_pasg <- summary(no_pasg_lm)$coefficients[2,1]
coeff_speed_ground <- summary(speed_ground_lm)$coefficients[2,1]
coeff_speed_air <- summary(speed_air_lm)$coefficients[2,1]
coeff_height <- summary(height_lm)$coefficients[2,1]
coeff_pitch <- summary(pitch_lm)$coefficients[2,1]
names <- c("aircraft", "duration","no_pasg","speed_ground","speed_air","height","pitch")
coefficient_size <- c(coeff_aircraft, coeff_duration, coeff_no_pasg, coeff_speed_ground, coeff_speed_air, coeff_height, coeff_pitch)
direction <- ifelse(coefficient_size > 0, "Positive", "Negative")
Table3 <- data.frame(names, coefficient_size, direction)
names(Table3) <- c('names of variables', 'coefficient_value', 'direction of the regressioncoeff')
Table3 <- arrange(Table3, desc(Table3$coefficient_value))
Table3
Table0 <- data.frame(Table1[1], Table2[1], Table3[1])
Table0 <- Table0[1]
Table0
The results are consistent across the three tables and the variables are arranged based on output from the three tables.
Check collinearity
Do you observe any significance change and sign change? Check the correlation between Speed_ground and Speed_air. You may want to keep one of them in the model selection. Which one would you pick? Why?
model3 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$speed_air, data = new_FAA)
summary(speed_ground)
##
## Call:
## lm(formula = new_FAA$distance ~ new_FAA$speed_ground)
##
## Residuals:
## Min 1Q Median 3Q Max
## -897.09 -319.16 -72.09 210.83 1798.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1773.9407 67.8388 -26.15 <2e-16 ***
## new_FAA$speed_ground 41.4422 0.8302 49.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 448.1 on 829 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.7501
## F-statistic: 2492 on 1 and 829 DF, p-value: < 2.2e-16
summary(speed_air)
##
## Call:
## lm(formula = new_FAA$distance ~ new_FAA$speed_air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -903.06 -320.50 -70.28 215.92 1817.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1930.1608 71.2487 -27.09 <2e-16 ***
## new_FAA$speed_air 42.7893 0.8616 49.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 449.8 on 829 degrees of freedom
## Multiple R-squared: 0.7485, Adjusted R-squared: 0.7481
## F-statistic: 2467 on 1 and 829 DF, p-value: < 2.2e-16
summary(model3)
##
## Call:
## lm(formula = new_FAA$distance ~ new_FAA$speed_ground + new_FAA$speed_air,
## data = new_FAA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -896.00 -317.85 -72.66 207.47 1796.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1741.513 102.261 -17.030 <2e-16 ***
## new_FAA$speed_ground 49.644 19.365 2.564 0.0105 *
## new_FAA$speed_air -8.488 20.020 -0.424 0.6717
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 448.3 on 828 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.7498
## F-statistic: 1245 on 2 and 828 DF, p-value: < 2.2e-16
cor(new_FAA$speed_air, new_FAA$speed_ground)
## [1] 0.9990797
The Model 3: LD ~ Speed_ground + Speed_air has the sign of the speed_air change from positive to negative. We get a positive correlation between the speed_ground and speed_air to establish a strong positive linear association between the variables, therefore collinearity exists in this situation.
Based on the Adjusted R-squared, I will keep the Model 1: LD ~ Speed_ground.
Variable selection based on our ranking in Table 0.
Calculate the R-squared for each model. Plot these R-squared values versus the number of variables p. What patterns do you observe?
Table0
model1 <- lm(new_FAA$distance ~ new_FAA$speed_ground)
model2 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft)
model3 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + new_FAA$height)
model4 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + new_FAA$height + new_FAA$pitch)
model5 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + new_FAA$height + new_FAA$pitch + new_FAA$duration)
model6 <- lm(new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft + new_FAA$height + new_FAA$pitch + new_FAA$duration + new_FAA$no_pasg)
models<-c("model 1","model 2","model 3","model 4","model 5","model 6")
rsq_val <- c(summary(model1)$r.squared, summary(model2)$r.squared, summary(model3)$r.squared, summary(model4)$r.squared, summary(model5)$r.squared, summary(model6)$r.squared)
variable_count <-c(1,2,3,4,5,6)
compare <- data.frame(models, rsq_val, variable_count)
compare
#plot
ggplot(compare, aes(x = variable_count, y = rsq_val)) +
geom_point() +
labs(x = "Number of Variables", y = "R-squared Value")
As we add more variables into the model, the R-squared value increases.
adj_rsq_val <- c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared, summary(model3)$adj.r.squared, summary(model4)$adj.r.squared, summary(model5)$adj.r.squared, summary(model6)$adj.r.squared)
variable_count <-c(1,2,3,4,5,6)
compare <- data.frame(models, adj_rsq_val, variable_count)
compare
#plot
ggplot(compare, aes(x = variable_count, y = adj_rsq_val)) +
geom_point() +
labs(x = "Number of Variables", y = "Adjusted R-squared Value")
The adjusted R-squared parameter remains comparable after addition of 4 variables and beyond. The four variables are speed_ground,aircraft,height and pitch.
aic1 <- AIC(model1)
aic2 <- AIC(model2)
aic3 <- AIC(model3)
aic4 <- AIC(model4)
aic5 <- AIC(model5)
aic6 <- AIC(model6)
aic <- c(aic1, aic2, aic3, aic4, aic5, aic6)
variable_count <-c(1,2,3,4,5,6)
compare <- data.frame(models, aic, variable_count)
compare <- arrange(compare, -desc(aic))
compare
#plot
ggplot(compare, aes(x = variable_count, y = aic)) +
geom_point() +
labs(x = "Number of Variables", y = "AIC Value")
The Model 4 is chosen since it produces the lowest AIC value with 12095.05. The four variables are speed_ground, aircraft, height and pitch.
Steps 18 and 19 result in similar result output. We will select speed_ground, aircraft, height and pitch as variables to build a predictive model for LD.
Variable selection based on automate algorithm. Step 21. Use the R function “StepAIC” to perform forward variable selection. Compare the result with that in Step 19.
selection <- lm(new_FAA$distance~1)
step(selection, direction = "forward", scope = formula(model6))
## Start: AIC=11299.8
## new_FAA$distance ~ 1
##
## Df Sum of Sq RSS AIC
## + new_FAA$speed_ground 1 500382567 166457762 10148
## + new_FAA$aircraft 1 37818390 629021939 11253
## + new_FAA$height 1 6590108 660250221 11294
## + new_FAA$pitch 1 5050617 661789712 11296
## + new_FAA$duration 1 1685114 665155215 11300
## <none> 666840329 11300
## + new_FAA$no_pasg 1 210253 666630076 11302
##
## Step: AIC=10148.53
## new_FAA$distance ~ new_FAA$speed_ground
##
## Df Sum of Sq RSS AIC
## + new_FAA$aircraft 1 49848656 116609106 9854.8
## + new_FAA$height 1 14916377 151541385 10072.5
## + new_FAA$pitch 1 9765095 156692668 10100.3
## <none> 166457762 10148.5
## + new_FAA$no_pasg 1 207528 166250234 10149.5
## + new_FAA$duration 1 51669 166406094 10150.3
##
## Step: AIC=9854.77
## new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft
##
## Df Sum of Sq RSS AIC
## + new_FAA$height 1 15848830 100760276 9735.4
## + new_FAA$pitch 1 455453 116153653 9853.5
## <none> 116609106 9854.8
## + new_FAA$no_pasg 1 87171 116521935 9856.1
## + new_FAA$duration 1 8445 116600661 9856.7
##
## Step: AIC=9735.37
## new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft +
## new_FAA$height
##
## Df Sum of Sq RSS AIC
## + new_FAA$pitch 1 315259 100445017 9734.8
## <none> 100760276 9735.4
## + new_FAA$no_pasg 1 232003 100528273 9735.5
## + new_FAA$duration 1 3976 100756300 9737.3
##
## Step: AIC=9734.77
## new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft +
## new_FAA$height + new_FAA$pitch
##
## Df Sum of Sq RSS AIC
## <none> 100445017 9734.8
## + new_FAA$no_pasg 1 225608 100219409 9734.9
## + new_FAA$duration 1 6696 100438321 9736.7
##
## Call:
## lm(formula = new_FAA$distance ~ new_FAA$speed_ground + new_FAA$aircraft +
## new_FAA$height + new_FAA$pitch)
##
## Coefficients:
## (Intercept) new_FAA$speed_ground new_FAA$aircraft0
## -2664.32 42.43 481.27
## new_FAA$height new_FAA$pitch
## 14.09 39.61
The Step forward AIC algorithm suggests to use all the variables 9734.77. The variables picked in Step 21 are consistent with the ones found in Step 18-20.