Introduction

Problem
The goal here is to reduce the risk of landing overrun.

Approach
We will study the factors that impact the landing distance of a commercial flight.

About the data
Landing data (landing distance and other parameters) from 950 commercial flights (not real data set but simulated from statistical models). See two Excel files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ (150 flights).

Variable dictionary:

  • Aircraft: The make of an aircraft (Boeing or Airbus).
  • Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
  • No_pasg: The number of passengers in a flight.
  • Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
  • Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
  • Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
  • Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
  • Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

Exploratory Data Analysis

Initial Exploration of the data

## Part 1. Practice of modeling the landing distance using linear regression ##

##################  Initial exploration of the data  ##########################


################## Install and load packages  #################################
library(dplyr)
library(plyr)
library(readxl)
library(stats)
library(tidyverse)
detach("package:tidyverse", unload = TRUE)
library(ggplot2)
library(grid)
library(pdp)
library(purrr)

# Step 1

FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")

Structure of Data in file FAA1

str(FAA1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    800 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...


Structure of Data in file FAA2

str(FAA2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    150 obs. of  7 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

There are 800 observations in file FAA1 with 8 variables compared to 150 observations in file FAA2 with 7 variables. Except aircraft column which is a character, all other columns are of numeric data type. Duration is not available in FAA2.

# Step 3

FAA_Merged <- rbind.fill(FAA1, FAA2)
FAA_Unique <- distinct(FAA_Merged, aircraft, no_pasg,
                       speed_ground, height, pitch,
                       distance, .keep_all = TRUE)

Total number of records in the merged data: 950
Total number of records in the merged data: 850

After merging there are 950 records in the combined file. When we look for duplicates, 100 records were found and removed. So, the resulting data frame has 850 records.


Structure of combined & unique data set (duplicates removed)

str(FAA_Unique)
## 'data.frame':    850 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...


Summary of the combined & unique data set (duplicates removed)

summary(FAA_Unique)
##    aircraft            duration         no_pasg      speed_ground   
##  Length:850         Min.   : 14.76   Min.   :29.0   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90  
##  Mode  :character   Median :153.95   Median :60.0   Median : 79.64  
##                     Mean   :154.01   Mean   :60.1   Mean   : 79.45  
##                     3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06  
##                     Max.   :305.62   Max.   :87.0   Max.   :141.22  
##                     NA's   :50                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.25   1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :101.15   Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :103.80   Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:109.40   3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :642
  • The sample size is 850 with 8 variables. Summary for each variable is given in the screenshot above.
  • There were 100 duplicates in the merged data which have been removed.
  • Speed_air has 642 NA’s out of 850 records.
  • The column duration is not present in the FAA2 file.

Data Cleaning & Further Exploration

Histograms of all numeric variables present in the data

########### Data Cleaning & Further Exploration #####################

# Step 6

# histograms

hist(FAA_Unique$duration)

hist(FAA_Unique$no_pasg)

hist(FAA_Unique$speed_ground)

hist(FAA_Unique$speed_air)

hist(FAA_Unique$height)

hist(FAA_Unique$pitch)

hist(FAA_Unique$distance)


Boxplot of all variables present in the data

# Boxplots

boxplot(FAA_Unique[,2:7])

#Scatter plot

FAA_Cleaned <- FAA_Unique %>% 
                filter(duration > 40 | is.na(duration)) %>%
                  filter((speed_ground >= 30 && speed_ground <= 140) | is.na(speed_ground)) %>% 
                    filter((speed_air >= 30 && speed_air <= 140) | is.na(speed_air)) %>% 
                      filter(height >= 6 | is.na(height)) %>% 
                        filter(distance < 6000 | is.na(distance))

After applying all the filters for outliers, we have removed 17 outliers and the number of records left is 833 now.

Histograms of all numeric variables present in the data after removing outliers

# Step 8

hist(FAA_Cleaned$duration)

hist(FAA_Cleaned$no_pasg)

hist(FAA_Cleaned$speed_ground)

hist(FAA_Cleaned$speed_air)

hist(FAA_Cleaned$height)

hist(FAA_Cleaned$pitch)

hist(FAA_Cleaned$distance)

Findings from the cleaned data set

  • There is collinearity between speed_air and speed_ground
  • Data for speed_air is not available for speeds less than 90. There are 562 NA’s in this column
  • All variables, other than distance and speed_air, are normally distributed
  • Speed_air is left skewed as can be seen from the histogram
  • 100 duplicates were found which were removed

Important Factors Affecting Landing Distance

Initial analysis for identifying important factors

# Step 10

FAA_Cleaned$aircraft <- as.factor(FAA_Cleaned$aircraft)
levels(FAA_Cleaned$aircraft)[1] <- 0
levels(FAA_Cleaned$aircraft)[2] <- 1
FAA_Cleaned$aircraft <- as.numeric(FAA_Cleaned$aircraft)


cor_duration <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$duration, use = "complete.obs"),2)
cor_no_pasg <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$no_pasg, use = "complete.obs"),2)
cor_speed_ground <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$speed_ground, use = "complete.obs"),2)
cor_speed_air <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$speed_air, use = "complete.obs"),2)
cor_height <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$height, use = "complete.obs"),2)
cor_pitch <- round(cor(FAA_Cleaned$distance, FAA_Cleaned$pitch, use = "complete.obs"),2)

Correlation of distance with:

  • Duration: -0.05
  • Number of passengers: -0.02
  • Speed_ground: 0.86
  • Speed_air: 0.94
  • Height: 0.1
  • Pitch: 0.09

There is a high correlation of landing distance with speed_ground and speed_air.

Scatter plots of all variables with landing distance

# Step 11

p1 <- ggplot(FAA_Cleaned, aes( x = duration, y = distance)) + geom_point()
p2 <- ggplot(FAA_Cleaned, aes( x = no_pasg, y = distance)) + geom_point()
p3 <- ggplot(FAA_Cleaned, aes( x = speed_ground, y = distance)) + geom_point()
p4 <- ggplot(FAA_Cleaned, aes( x = speed_air, y = distance)) + geom_point()
p5 <- ggplot(FAA_Cleaned, aes( x = height, y = distance)) + geom_point()
p6 <- ggplot(FAA_Cleaned, aes( x = pitch, y = distance)) + geom_point()
p7 <- ggplot(FAA_Cleaned, aes( x = aircraft, y = distance)) + geom_point()

grid.arrange(p1,p2,p3, p4, p5, p6, p7, nrow = 2, ncol = 4)

The correlation plots are quite consistent with what we have observed in the previous step. As we can see in the plots, landing distance has a high correlation with speed_ground and speed_air.

Linear Regression

We are going to regress landing distance on each of the predictor variables to understand their individual effects on the response variable.

model_1 <- lm(distance ~ aircraft, FAA_Cleaned)
model_2 <- lm(distance ~ duration, FAA_Cleaned)
model_3 <- lm(distance ~ no_pasg, FAA_Cleaned)
model_4 <- lm(distance ~ speed_ground, FAA_Cleaned)
model_5 <- lm(distance ~ speed_air, FAA_Cleaned)
model_6 <- lm(distance ~ height, FAA_Cleaned)
model_7 <- lm(distance ~ pitch, FAA_Cleaned)


Summary of linear regression model with only aircraft make as predictor

summary(model_1)
## 
## Call:
## lm(formula = distance ~ aircraft, data = FAA_Cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1281.6  -631.4  -230.4   388.2  3633.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   898.48      93.67   9.592  < 2e-16 ***
## aircraft      424.83      60.45   7.028 4.38e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 870.5 on 831 degrees of freedom
## Multiple R-squared:  0.0561, Adjusted R-squared:  0.05496 
## F-statistic: 49.39 on 1 and 831 DF,  p-value: 4.377e-12


Summary of linear regression model with only duration as predictor

summary(model_2)
## 
## Call:
## lm(formula = distance ~ duration, data = FAA_Cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1463.7  -614.1  -273.7   408.9  3848.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1690.9392   108.3535  15.606   <2e-16 ***
## duration      -0.9727     0.6681  -1.456    0.146    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 903 on 781 degrees of freedom
##   (50 observations deleted due to missingness)
## Multiple R-squared:  0.002707,   Adjusted R-squared:  0.00143 
## F-statistic:  2.12 on 1 and 781 DF,  p-value: 0.1458


Summary of linear regression model with only number of passengers as predictor

summary(model_3)
## 
## Call:
## lm(formula = distance ~ no_pasg, data = FAA_Cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1445.0  -621.7  -270.1   415.3  3884.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1645.696    250.610   6.567 9.05e-11 ***
## no_pasg       -2.065      4.142  -0.499    0.618    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 895.8 on 831 degrees of freedom
## Multiple R-squared:  0.000299,   Adjusted R-squared:  -0.000904 
## F-statistic: 0.2486 on 1 and 831 DF,  p-value: 0.6182


Summary of linear regression model with only speed ground as predictor

summary(model_4)
## 
## Call:
## lm(formula = distance ~ speed_ground, data = FAA_Cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -904.18 -319.13  -75.69  213.51 1912.03 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1720.6284    68.3579  -25.17   <2e-16 ***
## speed_ground    40.8252     0.8374   48.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 456 on 831 degrees of freedom
## Multiple R-squared:  0.7409, Adjusted R-squared:  0.7406 
## F-statistic:  2377 on 1 and 831 DF,  p-value: < 2.2e-16


Summary of linear regression model with only speed air as predictor

summary(model_5)
## 
## Call:
## lm(formula = distance ~ speed_air, data = FAA_Cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -776.21 -196.39    8.72  209.17  624.34 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5455.709    207.547  -26.29   <2e-16 ***
## speed_air      79.532      1.997   39.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.3 on 201 degrees of freedom
##   (630 observations deleted due to missingness)
## Multiple R-squared:  0.8875, Adjusted R-squared:  0.887 
## F-statistic:  1586 on 1 and 201 DF,  p-value: < 2.2e-16


Summary of linear regression model with only height as predictor

summary(model_6)
## 
## Call:
## lm(formula = distance ~ height, data = FAA_Cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1337.2  -605.8  -253.2   388.7  3933.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1243.121    101.054  12.302  < 2e-16 ***
## height         9.151      3.161   2.895  0.00389 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 891.5 on 831 degrees of freedom
## Multiple R-squared:  0.009988,   Adjusted R-squared:  0.008796 
## F-statistic: 8.383 on 1 and 831 DF,  p-value: 0.003886


Summary of linear regression model with only pitch as predictor

summary(model_7)
## 
## Call:
## lm(formula = distance ~ pitch, data = FAA_Cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1337.2  -643.6  -240.3   402.7  3840.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    933.2      237.6   3.928 9.28e-05 ***
## pitch          146.9       58.8   2.498   0.0127 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 892.6 on 831 degrees of freedom
## Multiple R-squared:  0.007453,   Adjusted R-squared:  0.006259 
## F-statistic:  6.24 on 1 and 831 DF,  p-value: 0.01268


Overall Summary

S.No <- seq(1,7)

Variable_Name <- c(names(FAA_Cleaned)[1:7])

Size_of_p_value <- c(format(summary(model_1)$coefficients[2,4], scientific = TRUE),
                     format(summary(model_2)$coefficients[2,4], sceintific = TRUE),
                     format(summary(model_3)$coefficients[2,4], sceintific = TRUE),
                     format(summary(model_4)$coefficients[2,4], sceintific = TRUE),
                     format(summary(model_5)$coefficients[2,4], sceintific = TRUE),
                     format(summary(model_6)$coefficients[2,4], sceintific = TRUE),
                     format(summary(model_7)$coefficients[2,4], sceintific = TRUE)) 
  
Value_of_regression_coefficient <- 
  c(format(summary(model_1)$coefficients[2,1], scientific = TRUE),
    format(summary(model_2)$coefficients[2,1], sceintific = TRUE),
    format(summary(model_3)$coefficients[2,1], sceintific = TRUE),
    format(summary(model_4)$coefficients[2,1], sceintific = TRUE),
    format(summary(model_5)$coefficients[2,1], sceintific = TRUE),
    format(summary(model_6)$coefficients[2,1], sceintific = TRUE),
    format(summary(model_7)$coefficients[2,1], sceintific = TRUE))
  
Regression_Model <- data.frame(S.No, Variable_Name, Size_of_p_value, Value_of_regression_coefficient)

library(kableExtra)
Regression_Model %>% 
  kable() %>% 
  kable_styling()
S.No Variable_Name Size_of_p_value Value_of_regression_coefficient
1 aircraft 4.377142e-12 4.24835e+02
2 duration 0.1457886 -0.972696
3 no_pasg 0.6182093 -2.065074
4 speed_ground 5.951812e-246 40.82515
5 speed_air 2.500461e-97 79.5321
6 height 0.003885891 9.151432
7 pitch 0.01268068 146.8924

Standardizing predictor variables (Scaling)

After standardizing and taking all variables in the linear regression model, we get the following result:

FAA_Cleaned_2 <- as_tibble(scale(FAA_Cleaned[,2:7]))
FAA_Cleaned_3 <- cbind(FAA_Cleaned_2, FAA_Cleaned[,1], FAA_Cleaned[,8])
  names(FAA_Cleaned_3)[7] <- "aircraft"
  names(FAA_Cleaned_3)[8] <- "distance"
  
  model_all_var_std <- lm(FAA_Cleaned_3$distance ~ ., FAA_Cleaned_3)  
  c <- summary(model_all_var_std)
  c[["coefficients"]]
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  2166.713387 157.536639 13.7537109 3.402695e-30
## duration        6.169085   9.857646  0.6258172 5.321979e-01
## no_pasg       -14.855202  10.332304 -1.4377434 1.521782e-01
## speed_ground  -66.953619 121.131116 -0.5527367 5.811038e-01
## speed_air     832.908267  63.500684 13.1165244 2.714575e-28
## height        133.725373  10.155421 13.1678814 1.907468e-28
## pitch          -7.099116   9.792516 -0.7249532 4.693869e-01
## aircraft      437.942766  21.262116 20.5973272 5.616468e-50

We can see from the p-values that speed_air, height and aircraft make are significant.

Checking Collinearity

Model_LD_1 <- lm(FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_ground)
  Model_LD_2 <- lm(FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_air)
  Model_LD_3 <- lm(FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_ground 
                                            + FAA_Cleaned_3$speed_air)


Linear model with only speed ground as the predictor variable

summary(Model_LD_1)
## 
## Call:
## lm(formula = FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_ground)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -904.18 -319.13  -75.69  213.51 1912.03 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1521.71      15.80   96.31   <2e-16 ***
## FAA_Cleaned_3$speed_ground   770.76      15.81   48.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 456 on 831 degrees of freedom
## Multiple R-squared:  0.7409, Adjusted R-squared:  0.7406 
## F-statistic:  2377 on 1 and 831 DF,  p-value: < 2.2e-16


Linear model with only speed air as the predictor variable

summary(Model_LD_2)
## 
## Call:
## lm(formula = FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -776.21 -196.39    8.72  209.17  624.34 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2774.67      19.39  143.07   <2e-16 ***
## FAA_Cleaned_3$speed_air   774.35      19.44   39.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.3 on 201 degrees of freedom
##   (630 observations deleted due to missingness)
## Multiple R-squared:  0.8875, Adjusted R-squared:  0.887 
## F-statistic:  1586 on 1 and 201 DF,  p-value: < 2.2e-16


Linear model with both speed ground and speed air as the predictor variable

summary(Model_LD_3)
## 
## Call:
## lm(formula = FAA_Cleaned_3$distance ~ FAA_Cleaned_3$speed_ground + 
##     FAA_Cleaned_3$speed_air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -819.74 -202.02    3.52  211.25  636.25 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  3119.5      304.9  10.231  < 2e-16 ***
## FAA_Cleaned_3$speed_ground   -271.4      239.5  -1.133    0.258    
## FAA_Cleaned_3$speed_air       914.8      125.5   7.291 6.99e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 200 degrees of freedom
##   (630 observations deleted due to missingness)
## Multiple R-squared:  0.8883, Adjusted R-squared:  0.8871 
## F-statistic:   795 on 2 and 200 DF,  p-value: < 2.2e-16

We can see from the results that speed_ground has changed sign to negative. Also, the significance has changed from significant to not significant for speed_ground. I will still pick speed_ground as there are a lot of missing values for speed_air. Also, speed_air and speed_ground are highly correlated.

Variable Selection

Plot of R-squared vs number of variables

# Step 17
  
  Model_New_LD_1 <- lm(distance ~ aircraft, data = FAA_Cleaned_3)
  Model_New_LD_2 <- lm(distance ~ aircraft + speed_ground, 
                       data = FAA_Cleaned_3)
  Model_New_LD_3 <- lm(distance ~ aircraft + speed_ground + height,
                       data = FAA_Cleaned_3)
  Model_New_LD_4 <- lm(distance ~ aircraft + speed_ground + height 
                       + no_pasg, data = FAA_Cleaned_3)
  Model_New_LD_5 <- lm(distance ~ aircraft + speed_ground + height 
                       + no_pasg + pitch, data = FAA_Cleaned_3)
  Model_New_LD_6 <- lm(distance ~ aircraft + speed_ground + height 
                       + no_pasg + pitch + duration, data = FAA_Cleaned_3)
  r_sqr <- c()
  r_sqr[1] <- summary(Model_New_LD_1)$r.squared
  r_sqr[2] <- summary(Model_New_LD_2)$r.squared
  r_sqr[3] <- summary(Model_New_LD_3)$r.squared
  r_sqr[4] <- summary(Model_New_LD_4)$r.squared
  r_sqr[5] <- summary(Model_New_LD_5)$r.squared
  r_sqr[6] <- summary(Model_New_LD_6)$r.squared
  
  par(mfrow = c(1,1))
  plot(1:6, r_sqr, main = "R squared vs number of variables", type = "b")


We can see that after adding speed_ground and aircraft as predictors, there is not much improvement in the r-squared values.


Plot of adjusted R-squared vs number of variables

adj_r_sqr <- c()
  adj_r_sqr[1] <- summary(Model_New_LD_1)$adj.r.squared
  adj_r_sqr[2] <- summary(Model_New_LD_2)$adj.r.squared
  adj_r_sqr[3] <- summary(Model_New_LD_3)$adj.r.squared
  adj_r_sqr[4] <- summary(Model_New_LD_4)$adj.r.squared
  adj_r_sqr[5] <- summary(Model_New_LD_5)$adj.r.squared
  adj_r_sqr[6] <- summary(Model_New_LD_6)$adj.r.squared
  
  par(mfrow = c(1,1))
  plot(1:6, adj_r_sqr, main = "Adjusted R squared vs number of variables", type = "b")


There is not much change in the graph compared to the last step where we used r-squared.


Plot of AIC values vs number of variables

# Step 19
  
  aic <- c()
  aic[1] <- AIC(Model_New_LD_1)
  aic[2] <- AIC(Model_New_LD_2)
  aic[3] <- AIC(Model_New_LD_3)
  aic[4] <- AIC(Model_New_LD_4)
  aic[5] <- AIC(Model_New_LD_5)
  aic[6] <- AIC(Model_New_LD_6)
  
  par(mfrow = c(1,1))
  plot(1:6, aic, main = "AIC vs number of variables", type = "b")

Looking at the previous three graphs, I am selecting speed_ground, aircraft make and height to build a predictive model for landing distance.

Linear regression model based on automated stepwise forward selection method

Here, we assign a null model (which has no predictors) and a full model (which has all the predictors)

library(MASS)  
  null_model <- lm(distance ~ 1, data = FAA_Cleaned[,-5])
  full_model <- lm(distance ~ ., data = FAA_Cleaned[,-5])
  forward <- step(null_model,
                     scope = list( lower = null_model, upper = full_model),
                     direction = "forward" )
## Start:  AIC=11325.29
## distance ~ 1
## 
##                Df Sum of Sq       RSS     AIC
## + speed_ground  1 474544306 163979281  9597.4
## + aircraft      1  33387572 605136015 10619.8
## + height        1   6947258 631576329 10653.3
## + pitch         1   2946920 635576667 10658.2
## + duration      1   1728559 636795028 10659.7
## <none>                      638523587 10659.8
## + no_pasg       1    170040 638353546 10661.6
## 
## Step:  AIC=10202.16
## distance ~ speed_ground
## 
##            Df Sum of Sq       RSS    AIC
## + aircraft  1  48531001 115448279 9324.6
## + height    1  13348111 150631170 9532.9
## + pitch     1   8648983 155330297 9557.0
## <none>                  163979281 9597.4
## + no_pasg   1    263028 163716252 9598.2
## + duration  1     36384 163942897 9599.2
## 
## Step:  AIC=9910.06
## distance ~ speed_ground + aircraft
## 
##            Df Sum of Sq       RSS    AIC
## + height    1  14355255 101093024 9222.7
## <none>                  115448279 9324.6
## + pitch     1    207422 115240857 9325.2
## + no_pasg   1     94868 115353412 9326.0
## + duration  1     16936 115431343 9326.5
## 
## Step:  AIC=9801.03
## distance ~ speed_ground + aircraft + height
## 
##            Df Sum of Sq       RSS    AIC
## <none>                  101093024 9222.7
## + no_pasg   1    205566 100887458 9223.1
## + pitch     1     90919 101002105 9224.0
## + duration  1     10794 101082231 9224.6

StepAIC forward selection is starting with null model and improving AIC at every step to provide speed_ground, aircraft and height as the final predictors with the minimum AIC.

Executive Summary

  • Speed air and speed ground were high correlated. Speed air had a lot of missing values. So, we removed the variable speed air in our final model as the effects of speed air are also explained by speed ground

  • After taking into consideration all the methods applied, we will select aircraft make, speed_ground and height for predicting landing distance.

  • The final model to predict landing distance is

Model_New_LD_3
## 
## Call:
## lm(formula = distance ~ aircraft + speed_ground + height, data = FAA_Cleaned_3)
## 
## Coefficients:
##  (Intercept)      aircraft  speed_ground        height  
##        783.1         503.5         789.7         135.2