Problem
The goal here is to reduce the risk of landing overrun.
Approach
We will study the factors that impact the landing distance of a commercial flight.
About the data
Landing data (landing distance and other parameters) from 950 commercial flights (not real data set but simulated from statistical models). See two Excel files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ (150 flights).
Variable dictionary:
Initial Exploration of the data
## Part 1. Practice of modeling the landing distance using linear regression ##
################## Initial exploration of the data ##########################
################## Install and load packages #################################
library(dplyr)
library(plyr)
library(readxl)
library(stats)
library(tidyverse)
detach("package:tidyverse", unload = TRUE)
library(ggplot2)
library(grid)
library(pdp)
library(purrr)
# Step 1
FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")
Structure of Data in file FAA1
str(FAA1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 800 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Structure of Data in file FAA2
str(FAA2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 7 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
There are 800 observations in file FAA1 with 8 variables compared to 150 observations in file FAA2 with 7 variables. Except aircraft column which is a character, all other columns are of numeric data type. Duration is not available in FAA2.
# Step 3
FAA_Merged <- rbind.fill(FAA1, FAA2)
FAA_Unique <- distinct(FAA_Merged, aircraft, no_pasg,
speed_ground, height, pitch,
distance, .keep_all = TRUE)
Total number of records in the merged data: 950
Total number of records in the merged data: 850
After merging there are 950 records in the combined file. When we look for duplicates, 100 records were found and removed. So, the resulting data frame has 850 records.
Structure of combined & unique data set (duplicates removed)
str(FAA_Unique)
## 'data.frame': 850 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Summary of the combined & unique data set (duplicates removed)
summary(FAA_Unique)
## aircraft duration no_pasg speed_ground
## Length:850 Min. : 14.76 Min. :29.0 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90
## Mode :character Median :153.95 Median :60.0 Median : 79.64
## Mean :154.01 Mean :60.1 Mean : 79.45
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06
## Max. :305.62 Max. :87.0 Max. :141.22
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :101.15 Median :30.093 Median :4.008 Median :1258.09
## Mean :103.80 Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :642
Histograms of all numeric variables present in the data
########### Data Cleaning & Further Exploration #####################
# Step 6
# histograms
hist(FAA_Unique$duration)
hist(FAA_Unique$no_pasg)
hist(FAA_Unique$speed_ground)
hist(FAA_Unique$speed_air)
hist(FAA_Unique$height)
hist(FAA_Unique$pitch)
hist(FAA_Unique$distance)
Boxplot of all variables present in the data
# Boxplots
boxplot(FAA_Unique[,2:7])
#Scatter plot
FAA_Cleaned <- FAA_Unique %>%
filter(duration > 40 | is.na(duration)) %>%
filter((speed_ground >= 30 && speed_ground <= 140) | is.na(speed_ground)) %>%
filter((speed_air >= 30 && speed_air <= 140) | is.na(speed_air)) %>%
filter(height >= 6 | is.na(height)) %>%
filter(distance < 6000 | is.na(distance))
After applying all the filters for outliers, we have removed 17 outliers and the number of records left is 833 now.
Histograms of all numeric variables present in the data after removing outliers
# Step 8
hist(FAA_Cleaned$duration)
hist(FAA_Cleaned$no_pasg)
hist(FAA_Cleaned$speed_ground)
hist(FAA_Cleaned$speed_air)
hist(FAA_Cleaned$height)
hist(FAA_Cleaned$pitch)
hist(FAA_Cleaned$distance)
Findings from the cleaned data set
Initial analysis for identifying important factors
# Step 10
FAA_Cleaned$aircraft <- as.factor(FAA_Cleaned$aircraft)
levels(FAA_Cleaned$aircraft)[1] <- 0
levels(FAA_Cleaned$aircraft)[2] <- 1
FAA_Cleaned$aircraft <- as.numeric(FAA_Cleaned$aircraft)
cor_duration <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$duration, use = "complete.obs"),2)
cor_no_pasg <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$no_pasg, use = "complete.obs"),2)
cor_speed_ground <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$speed_ground, use = "complete.obs"),2)
cor_speed_air <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$speed_air, use = "complete.obs"),2)
cor_height <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$height, use = "complete.obs"),2)
cor_pitch <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$pitch, use = "complete.obs"),2)
Correlation of distance with:
There is a high correlation of landing distance with speed_ground and speed_air.
Scatter plots of all variables with landing distance
# Step 11
p1 <- ggplot(FAA_Cleaned, aes( x = duration, y = distance)) + geom_point()
p2 <- ggplot(FAA_Cleaned, aes( x = no_pasg, y = distance)) + geom_point()
p3 <- ggplot(FAA_Cleaned, aes( x = speed_ground, y = distance)) + geom_point()
p4 <- ggplot(FAA_Cleaned, aes( x = speed_air, y = distance)) + geom_point()
p5 <- ggplot(FAA_Cleaned, aes( x = height, y = distance)) + geom_point()
p6 <- ggplot(FAA_Cleaned, aes( x = pitch, y = distance)) + geom_point()
p7 <- ggplot(FAA_Cleaned, aes( x = aircraft, y = distance)) + geom_point()
grid.arrange(p1,p2,p3, p4, p5, p6, p7, nrow = 2, ncol = 4)
The correlation plots are quite consistent with what we have observed in the previous step. As we can see in the plots, landing distance has a high correlation with speed_ground and speed_air.
We are going to regress landing distance on each of the predictor variables to understand their individual effects on the response variable.
model_1 <- lm(distance ~ aircraft, FAA_Cleaned)
model_2 <- lm(distance ~ duration, FAA_Cleaned)
model_3 <- lm(distance ~ no_pasg, FAA_Cleaned)
model_4 <- lm(distance ~ speed_ground, FAA_Cleaned)
model_5 <- lm(distance ~ speed_air, FAA_Cleaned)
model_6 <- lm(distance ~ height, FAA_Cleaned)
model_7 <- lm(distance ~ pitch, FAA_Cleaned)
Summary of linear regression model with only aircraft make as predictor
summary(model_1)
##
## Call:
## lm(formula = distance ~ aircraft, data = FAA_Cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1281.6 -631.4 -230.4 388.2 3633.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 898.48 93.67 9.592 < 2e-16 ***
## aircraft 424.83 60.45 7.028 4.38e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 870.5 on 831 degrees of freedom
## Multiple R-squared: 0.0561, Adjusted R-squared: 0.05496
## F-statistic: 49.39 on 1 and 831 DF, p-value: 4.377e-12
Summary of linear regression model with only duration as predictor
summary(model_2)
##
## Call:
## lm(formula = distance ~ duration, data = FAA_Cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1463.7 -614.1 -273.7 408.9 3848.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1690.9392 108.3535 15.606 <2e-16 ***
## duration -0.9727 0.6681 -1.456 0.146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 903 on 781 degrees of freedom
## (50 observations deleted due to missingness)
## Multiple R-squared: 0.002707, Adjusted R-squared: 0.00143
## F-statistic: 2.12 on 1 and 781 DF, p-value: 0.1458
Summary of linear regression model with only number of passengers as predictor
summary(model_3)
##
## Call:
## lm(formula = distance ~ no_pasg, data = FAA_Cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1445.0 -621.7 -270.1 415.3 3884.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1645.696 250.610 6.567 9.05e-11 ***
## no_pasg -2.065 4.142 -0.499 0.618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 895.8 on 831 degrees of freedom
## Multiple R-squared: 0.000299, Adjusted R-squared: -0.000904
## F-statistic: 0.2486 on 1 and 831 DF, p-value: 0.6182
Summary of linear regression model with only speed ground as predictor
summary(model_4)
##
## Call:
## lm(formula = distance ~ speed_ground, data = FAA_Cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -904.18 -319.13 -75.69 213.51 1912.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1720.6284 68.3579 -25.17 <2e-16 ***
## speed_ground 40.8252 0.8374 48.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 456 on 831 degrees of freedom
## Multiple R-squared: 0.7409, Adjusted R-squared: 0.7406
## F-statistic: 2377 on 1 and 831 DF, p-value: < 2.2e-16
Summary of linear regression model with only speed air as predictor
summary(model_5)
##
## Call:
## lm(formula = distance ~ speed_air, data = FAA_Cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -776.21 -196.39 8.72 209.17 624.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5455.709 207.547 -26.29 <2e-16 ***
## speed_air 79.532 1.997 39.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.3 on 201 degrees of freedom
## (630 observations deleted due to missingness)
## Multiple R-squared: 0.8875, Adjusted R-squared: 0.887
## F-statistic: 1586 on 1 and 201 DF, p-value: < 2.2e-16
Summary of linear regression model with only height as predictor
summary(model_6)
##
## Call:
## lm(formula = distance ~ height, data = FAA_Cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1337.2 -605.8 -253.2 388.7 3933.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1243.121 101.054 12.302 < 2e-16 ***
## height 9.151 3.161 2.895 0.00389 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 891.5 on 831 degrees of freedom
## Multiple R-squared: 0.009988, Adjusted R-squared: 0.008796
## F-statistic: 8.383 on 1 and 831 DF, p-value: 0.003886
Summary of linear regression model with only pitch as predictor
summary(model_7)
##
## Call:
## lm(formula = distance ~ pitch, data = FAA_Cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1337.2 -643.6 -240.3 402.7 3840.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 933.2 237.6 3.928 9.28e-05 ***
## pitch 146.9 58.8 2.498 0.0127 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 892.6 on 831 degrees of freedom
## Multiple R-squared: 0.007453, Adjusted R-squared: 0.006259
## F-statistic: 6.24 on 1 and 831 DF, p-value: 0.01268
Overall Summary
S.No <- seq(1,7)
Variable_Name <- c(names(FAA_Cleaned)[1:7])
Size_of_p_value <- c(format(summary(model_1)$coefficients[2,4], scientific = TRUE),
format(summary(model_2)$coefficients[2,4], sceintific = TRUE),
format(summary(model_3)$coefficients[2,4], sceintific = TRUE),
format(summary(model_4)$coefficients[2,4], sceintific = TRUE),
format(summary(model_5)$coefficients[2,4], sceintific = TRUE),
format(summary(model_6)$coefficients[2,4], sceintific = TRUE),
format(summary(model_7)$coefficients[2,4], sceintific = TRUE))
Value_of_regression_coefficient <-
c(format(summary(model_1)$coefficients[2,1], scientific = TRUE),
format(summary(model_2)$coefficients[2,1], sceintific = TRUE),
format(summary(model_3)$coefficients[2,1], sceintific = TRUE),
format(summary(model_4)$coefficients[2,1], sceintific = TRUE),
format(summary(model_5)$coefficients[2,1], sceintific = TRUE),
format(summary(model_6)$coefficients[2,1], sceintific = TRUE),
format(summary(model_7)$coefficients[2,1], sceintific = TRUE))
Regression_Model <- data.frame(S.No, Variable_Name, Size_of_p_value, Value_of_regression_coefficient)
library(kableExtra)
Regression_Model %>%
kable() %>%
kable_styling()
| S.No | Variable_Name | Size_of_p_value | Value_of_regression_coefficient |
|---|---|---|---|
| 1 | aircraft | 4.377142e-12 | 4.24835e+02 |
| 2 | duration | 0.1457886 | -0.972696 |
| 3 | no_pasg | 0.6182093 | -2.065074 |
| 4 | speed_ground | 5.951812e-246 | 40.82515 |
| 5 | speed_air | 2.500461e-97 | 79.5321 |
| 6 | height | 0.003885891 | 9.151432 |
| 7 | pitch | 0.01268068 | 146.8924 |
Standardizing predictor variables (Scaling)
After standardizing and taking all variables in the linear regression model, we get the following result:
FAA_Cleaned_2 <- as_tibble(scale(FAA_Cleaned[,2:7]))
FAA_Cleaned_3 <- cbind(FAA_Cleaned_2, FAA_Cleaned[,1], FAA_Cleaned[,8])
names(FAA_Cleaned_3)[7] <- "aircraft"
names(FAA_Cleaned_3)[8] <- "distance"
model_all_var_std <- lm(FAA_Cleaned_3$distance ~ ., FAA_Cleaned_3)
c <- summary(model_all_var_std)
c[["coefficients"]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2166.713387 157.536639 13.7537109 3.402695e-30
## duration 6.169085 9.857646 0.6258172 5.321979e-01
## no_pasg -14.855202 10.332304 -1.4377434 1.521782e-01
## speed_ground -66.953619 121.131116 -0.5527367 5.811038e-01
## speed_air 832.908267 63.500684 13.1165244 2.714575e-28
## height 133.725373 10.155421 13.1678814 1.907468e-28
## pitch -7.099116 9.792516 -0.7249532 4.693869e-01
## aircraft 437.942766 21.262116 20.5973272 5.616468e-50
We can see from the p-values that speed_air, height and aircraft make are significant.
Model_LD_1 <- lm(FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_ground)
Model_LD_2 <- lm(FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_air)
Model_LD_3 <- lm(FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_ground
+ FAA_Cleaned_3$speed_air)
Linear model with only speed ground as the predictor variable
summary(Model_LD_1)
##
## Call:
## lm(formula = FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_ground)
##
## Residuals:
## Min 1Q Median 3Q Max
## -904.18 -319.13 -75.69 213.51 1912.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1521.71 15.80 96.31 <2e-16 ***
## FAA_Cleaned_3$speed_ground 770.76 15.81 48.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 456 on 831 degrees of freedom
## Multiple R-squared: 0.7409, Adjusted R-squared: 0.7406
## F-statistic: 2377 on 1 and 831 DF, p-value: < 2.2e-16
Linear model with only speed air as the predictor variable
summary(Model_LD_2)
##
## Call:
## lm(formula = FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -776.21 -196.39 8.72 209.17 624.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2774.67 19.39 143.07 <2e-16 ***
## FAA_Cleaned_3$speed_air 774.35 19.44 39.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.3 on 201 degrees of freedom
## (630 observations deleted due to missingness)
## Multiple R-squared: 0.8875, Adjusted R-squared: 0.887
## F-statistic: 1586 on 1 and 201 DF, p-value: < 2.2e-16
Linear model with both speed ground and speed air as the predictor variable
summary(Model_LD_3)
##
## Call:
## lm(formula = FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_ground +
## FAA_Cleaned_3$speed_air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -819.74 -202.02 3.52 211.25 636.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3119.5 304.9 10.231 < 2e-16 ***
## FAA_Cleaned_3$speed_ground -271.4 239.5 -1.133 0.258
## FAA_Cleaned_3$speed_air 914.8 125.5 7.291 6.99e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 200 degrees of freedom
## (630 observations deleted due to missingness)
## Multiple R-squared: 0.8883, Adjusted R-squared: 0.8871
## F-statistic: 795 on 2 and 200 DF, p-value: < 2.2e-16
We can see from the results that speed_ground has changed sign to negative. Also, the significance has changed from significant to not significant for speed_ground. I will still pick speed_ground as there are a lot of missing values for speed_air. Also, speed_air and speed_ground are highly correlated.
Plot of R-squared vs number of variables
# Step 17
Model_New_LD_1 <- lm(distance ~ aircraft, data = FAA_Cleaned_3)
Model_New_LD_2 <- lm(distance ~ aircraft + speed_ground,
data = FAA_Cleaned_3)
Model_New_LD_3 <- lm(distance ~ aircraft + speed_ground + height,
data = FAA_Cleaned_3)
Model_New_LD_4 <- lm(distance ~ aircraft + speed_ground + height
+ no_pasg, data = FAA_Cleaned_3)
Model_New_LD_5 <- lm(distance ~ aircraft + speed_ground + height
+ no_pasg + pitch, data = FAA_Cleaned_3)
Model_New_LD_6 <- lm(distance ~ aircraft + speed_ground + height
+ no_pasg + pitch + duration, data = FAA_Cleaned_3)
r_sqr <- c()
r_sqr[1] <- summary(Model_New_LD_1)$r.squared
r_sqr[2] <- summary(Model_New_LD_2)$r.squared
r_sqr[3] <- summary(Model_New_LD_3)$r.squared
r_sqr[4] <- summary(Model_New_LD_4)$r.squared
r_sqr[5] <- summary(Model_New_LD_5)$r.squared
r_sqr[6] <- summary(Model_New_LD_6)$r.squared
par(mfrow = c(1,1))
plot(1:6, r_sqr, main = "R squared vs number of variables", type = "b")
We can see that after adding speed_ground and aircraft as predictors, there is not much improvement in the r-squared values.
Plot of adjusted R-squared vs number of variables
adj_r_sqr <- c()
adj_r_sqr[1] <- summary(Model_New_LD_1)$adj.r.squared
adj_r_sqr[2] <- summary(Model_New_LD_2)$adj.r.squared
adj_r_sqr[3] <- summary(Model_New_LD_3)$adj.r.squared
adj_r_sqr[4] <- summary(Model_New_LD_4)$adj.r.squared
adj_r_sqr[5] <- summary(Model_New_LD_5)$adj.r.squared
adj_r_sqr[6] <- summary(Model_New_LD_6)$adj.r.squared
par(mfrow = c(1,1))
plot(1:6, adj_r_sqr, main = "Adjusted R squared vs number of variables", type = "b")
There is not much change in the graph compared to the last step where we used r-squared.
Plot of AIC values vs number of variables
# Step 19
aic <- c()
aic[1] <- AIC(Model_New_LD_1)
aic[2] <- AIC(Model_New_LD_2)
aic[3] <- AIC(Model_New_LD_3)
aic[4] <- AIC(Model_New_LD_4)
aic[5] <- AIC(Model_New_LD_5)
aic[6] <- AIC(Model_New_LD_6)
par(mfrow = c(1,1))
plot(1:6, aic, main = "AIC vs number of variables", type = "b")
Looking at the previous three graphs, I am selecting speed_ground, aircraft make and height to build a predictive model for landing distance.
Linear regression model based on automated stepwise forward selection method
Here, we assign a null model (which has no predictors) and a full model (which has all the predictors)
library(MASS)
null_model <- lm(distance ~ 1, data = FAA_Cleaned[,-5])
full_model <- lm(distance ~ ., data = FAA_Cleaned[,-5])
forward <- step(null_model,
scope = list( lower = null_model, upper = full_model),
direction = "forward" )
## Start: AIC=11325.29
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_ground 1 474544306 163979281 9597.4
## + aircraft 1 33387572 605136015 10619.8
## + height 1 6947258 631576329 10653.3
## + pitch 1 2946920 635576667 10658.2
## + duration 1 1728559 636795028 10659.7
## <none> 638523587 10659.8
## + no_pasg 1 170040 638353546 10661.6
##
## Step: AIC=10202.16
## distance ~ speed_ground
##
## Df Sum of Sq RSS AIC
## + aircraft 1 48531001 115448279 9324.6
## + height 1 13348111 150631170 9532.9
## + pitch 1 8648983 155330297 9557.0
## <none> 163979281 9597.4
## + no_pasg 1 263028 163716252 9598.2
## + duration 1 36384 163942897 9599.2
##
## Step: AIC=9910.06
## distance ~ speed_ground + aircraft
##
## Df Sum of Sq RSS AIC
## + height 1 14355255 101093024 9222.7
## <none> 115448279 9324.6
## + pitch 1 207422 115240857 9325.2
## + no_pasg 1 94868 115353412 9326.0
## + duration 1 16936 115431343 9326.5
##
## Step: AIC=9801.03
## distance ~ speed_ground + aircraft + height
##
## Df Sum of Sq RSS AIC
## <none> 101093024 9222.7
## + no_pasg 1 205566 100887458 9223.1
## + pitch 1 90919 101002105 9224.0
## + duration 1 10794 101082231 9224.6
StepAIC forward selection is starting with null model and improving AIC at every step to provide speed_ground, aircraft and height as the final predictors with the minimum AIC.
Speed air and speed ground were high correlated. Speed air had a lot of missing values. So, we removed the variable speed air in our final model as the effects of speed air are also explained by speed ground
After taking into consideration all the methods applied, we will select aircraft make, speed_ground and height for predicting landing distance.
The final model to predict landing distance is
Model_New_LD_3
##
## Call:
## lm(formula = distance ~ aircraft + speed_ground + height, data = FAA_Cleaned_3)
##
## Coefficients:
## (Intercept) aircraft speed_ground height
## 783.1 503.5 789.7 135.2