Flight Landing Distance Model

Nitin Mittapally

Background

Data set: FAA Flight landing

Motivation: To reduce the risk of landing overrun

Goal: To study the factors and their impact on the landing distance for commercial flights

Variable Dictionary

Aircraft: The make of an aircraft (Boeing or Airbus).

Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.

No_pasg:,The number of passengers in a flight.

Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.

Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.

Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

Reading Data

faa1 = read_xls("./data/FAA1.xls")
faa2 = read_xls("./data/FAA2.xls")
str(faa1)
## tibble[,8] [800 x 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:800] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:800] 109 103 NA NA NA ...
##  $ height      : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:800] 3370 2988 1145 1664 1050 ...
str(faa2)
## tibble[,7] [150 x 7] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:150] 109 103 NA NA NA ...
##  $ height      : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:150] 3370 2988 1145 1664 1050 ...

We see the second dataset has one less feature available. Dataset-1 has 800 observations with 8 feaatures. Dataset-2 has 150 observations with 7 features.

Merging 2 data sets

faa2$duration = rep(NA, 150)
data = rbind(faa1, faa2)
features = colnames(data)

Checking for the duplicate records in the data.

data = data %>%  distinct(aircraft,no_pasg,speed_ground,speed_air,height,pitch,distance, .keep_all = T)
data$aircraft = as.factor(data$aircraft)
glimpse(data)
## Rows: 850
## Columns: 8
## $ aircraft     <fct> boeing, boeing, boeing, boeing, boeing, boeing, boeing, b~
## $ duration     <dbl> 98.47909, 125.73330, 112.01700, 196.82569, 90.09538, 137.~
## $ no_pasg      <dbl> 53, 69, 61, 56, 70, 55, 54, 57, 61, 56, 61, 54, 54, 58, 6~
## $ speed_ground <dbl> 107.91568, 101.65559, 71.05196, 85.81333, 59.88853, 75.01~
## $ speed_air    <dbl> 109.32838, 102.85141, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ height       <dbl> 27.41892, 27.80472, 18.58939, 30.74460, 32.39769, 41.2149~
## $ pitch        <dbl> 4.043515, 4.117432, 4.434043, 3.884236, 4.026096, 4.20385~
## $ distance     <dbl> 3369.8364, 2987.8039, 1144.9224, 1664.2182, 1050.2645, 16~

There were 100 duplicate records and were removed from the data. The combined data set, after removing duplicates has 850 observations.

summary(data)
##    aircraft      duration         no_pasg      speed_ground      speed_air     
##  airbus:450   Min.   : 14.76   Min.   :29.0   Min.   : 27.74   Min.   : 90.00  
##  boeing:400   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90   1st Qu.: 96.25  
##               Median :153.95   Median :60.0   Median : 79.64   Median :101.15  
##               Mean   :154.01   Mean   :60.1   Mean   : 79.45   Mean   :103.80  
##               3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06   3rd Qu.:109.40  
##               Max.   :305.62   Max.   :87.0   Max.   :141.22   Max.   :141.72  
##               NA's   :50                                       NA's   :642     
##      height           pitch          distance      
##  Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :59.946   Max.   :5.927   Max.   :6533.05  
## 

Observations on Dataset

  • There is difference number of features in 2 datasets given
  • The Dataset has balanced observations for both Aircraft makers
  • There are total of 850 Observations combined with 7 potential features
  • Speed_air has about ~75% missing values, rendering it to be mostly useless
  • There are some abnormal values present in the data, which are to be cleaned and processed

Understanding data

From the data description, every variable has a permissible range. If a record has any value outside this range, then it is considered as an abnormal observation. We’ll be removing all the abnormal observations from the data.

data = data %>% filter(
            (duration >= 40 | is.na(duration)) &
            (speed_ground >= 30 & speed_ground <= 140 | is.na(speed_ground)) &
            (height >= 6 | is.na(height)) &
            (distance <= 6000| is.na(distance))
          )
summary(data)
##    aircraft      duration         no_pasg       speed_ground   
##  airbus:444   Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
##  boeing:387   1st Qu.:119.63   1st Qu.:55.00   1st Qu.: 66.20  
##               Median :154.28   Median :60.00   Median : 79.79  
##               Mean   :154.78   Mean   :60.06   Mean   : 79.54  
##               3rd Qu.:189.66   3rd Qu.:65.00   3rd Qu.: 91.91  
##               Max.   :305.62   Max.   :87.00   Max.   :132.78  
##               NA's   :50                                       
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.23   1st Qu.:23.530   1st Qu.:3.640   1st Qu.: 893.28  
##  Median :101.12   Median :30.167   Median :4.001   Median :1262.15  
##  Mean   :103.48   Mean   :30.458   Mean   :4.005   Mean   :1522.48  
##  3rd Qu.:109.36   3rd Qu.:37.004   3rd Qu.:4.370   3rd Qu.:1936.63  
##  Max.   :132.91   Max.   :59.946   Max.   :5.927   Max.   :5381.96  
##  NA's   :628

19 Abnormal observations were removed from the data.

Checking for missing data records.

colSums(is.na(data[,2:8]))
##     duration      no_pasg speed_ground    speed_air       height        pitch 
##           50            0            0          628            0            0 
##     distance 
##            0

We see there are 628 Nulls in the speed_air and 50 nulls in the duration. Since about 75% of the data is missing in the column speed_air, removing that column would be better. For duration, we’ll try imputing the data.

data = data %>% mutate(duration=ifelse(is.na(duration), median(duration, na.rm=TRUE), duration))
colSums(is.na(data[,2:7]))
##     duration      no_pasg speed_ground    speed_air       height        pitch 
##            0            0            0          628            0            0
str(data)
## tibble[,8] [831 x 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num [1:831] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:831] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:831] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:831] 109 103 NA NA NA ...
##  $ height      : num [1:831] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:831] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:831] 3370 2988 1145 1664 1050 ...

Distributions of featues and landing distance.

data[,-1] %>%
  gather() %>%
  ggplot(aes(value)) +                            
  facet_wrap(~ key, scales = "free") +
  geom_histogram(aes(y = stat(density)), bins = 10) +
  geom_density(col = "red", size = 1)

Observations on cleaned dataset

  • we have a total of 831 observations and 6 potential features (ignoring speed_air)
  • 50 missing values in the duration column were imputed with median
  • Landing distance seems to be normally distributed

Correlation Analysis of the data.

We see a very strong correlation between speed_ground and landing distance. Rest all features don’t seem to be correlated with the landing distance. This is in consistent with the correlation table.

SLR Model with all variables and landing distance

cols = colnames(data)
table2 = data.frame(matrix(ncol = 4, nrow = 0))
colnames(table2) = c("variable","coef", "P_Val", "Direction")
for(i in c(1:7)){
  fm = paste("distance ~", cols[i])
  model = lm(fm, data= data)
  modelSummary = summary(model)
  direction = ifelse(modelSummary$coefficients[2,1] > 0, "Positive", "Negative")
  pval =  modelSummary$coefficients[2,4]
  coef_val = modelSummary$coefficients[2,1]
  table2[nrow(table2)+1,] =   c(cols[i], abs(coef_val), pval, direction)
}
print(table2)
##       variable              coef                 P_Val Direction
## 1     aircraft  427.666335349148  3.52619444130319e-12  Positive
## 2     duration 0.957383575177885     0.149327704730143  Negative
## 3      no_pasg  2.12458597966199     0.609252001778225  Negative
## 4 speed_ground  41.4421888175468 4.76637093869623e-252  Positive
## 5    speed_air  79.5320952653839  2.50046087022476e-97  Positive
## 6       height  9.10656853901242   0.00412385985399162  Positive
## 7        pitch  148.141877548958    0.0120812424909119  Positive

Standardizing data

data_sd = data %>% mutate(across(where(is.numeric), ~ (. - mean(., na.rm = T))/ sd(., na.rm = T)) )
data_sd[,-1] %>%
  gather() %>%
  ggplot(aes(value)) +                            
  facet_wrap(~ key, scales = "free") +
  geom_histogram(aes(y = stat(density)), bins = 10) +
  geom_density(col = "red", size = 1)

All features are standardized to have zero mean and unit variance.

Modeling with the standardized data

cols = colnames(data)
table3 = data.frame(matrix(ncol = 4, nrow = 0))
colnames(table3) = c("variable", "coef", "pval", "Direction")
for(i in c(1:7)){
  fm = paste("distance ~", cols[i])
  model = lm(fm, data= data_sd)
  modelSummary = summary(model)
  coef_val =  modelSummary$coefficients[2,1]
  pval =  modelSummary$coefficients[2,4]
  modeldirection = ifelse(coef_val > 0.0, "Positive", "Negative")
  table3[nrow(table3)+1,] =   c( cols[i],  abs(coef_val), pval,  modeldirection)
}
print(table3)
##       variable               coef                  pval Direction
## 1     aircraft  0.477126109401757  3.52619444130326e-12  Positive
## 2     duration 0.0500633017014816     0.149327704730145  Negative
## 3      no_pasg 0.0177566313292344     0.609252001778232  Negative
## 4 speed_ground  0.866243834797758 4.76637093869894e-252  Positive
## 5    speed_air  0.863900011442723   2.5004608702254e-97  Positive
## 6       height 0.0994112051009979    0.0041238598539917  Positive
## 7        pitch 0.0870284580029236     0.012081242490912  Positive

Checking for the factors that are highly affecting landing distance.

table0 = table3 %>%  arrange(desc(coef))
print(table0)
##       variable               coef                  pval Direction
## 1 speed_ground  0.866243834797758 4.76637093869894e-252  Positive
## 2    speed_air  0.863900011442723   2.5004608702254e-97  Positive
## 3     aircraft  0.477126109401757  3.52619444130326e-12  Positive
## 4       height 0.0994112051009979    0.0041238598539917  Positive
## 5        pitch 0.0870284580029236     0.012081242490912  Positive
## 6     duration 0.0500633017014816     0.149327704730145  Negative
## 7      no_pasg 0.0177566313292344     0.609252001778232  Negative

The 3 tables show similar kind of results. Hence, we can use table 3 to build table0. The above table shows the features which we can use to predict the Landing distance arranged based on their potential.

Checking for Collinear features

model1 = lm(distance~ speed_air, data = data)
summary(model1)
## 
## Call:
## lm(formula = distance ~ speed_air, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -776.21 -196.39    8.72  209.17  624.34 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5455.709    207.547  -26.29   <2e-16 ***
## speed_air      79.532      1.997   39.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.3 on 201 degrees of freedom
##   (628 observations deleted due to missingness)
## Multiple R-squared:  0.8875, Adjusted R-squared:  0.887 
## F-statistic:  1586 on 1 and 201 DF,  p-value: < 2.2e-16
model2 = lm(distance~ speed_ground, data = data)
summary(model2)
## 
## Call:
## lm(formula = distance ~ speed_ground, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -897.09 -319.16  -72.09  210.83 1798.88 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1773.9407    67.8388  -26.15   <2e-16 ***
## speed_ground    41.4422     0.8302   49.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 448.1 on 829 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7501 
## F-statistic:  2492 on 1 and 829 DF,  p-value: < 2.2e-16
model3 = lm(distance~ speed_ground + speed_air, data = data)
summary(model3)
## 
## Call:
## lm(formula = distance ~ speed_ground + speed_air, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -819.74 -202.02    3.52  211.25  636.25 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5462.28     207.48 -26.327  < 2e-16 ***
## speed_ground   -14.37      12.68  -1.133    0.258    
## speed_air       93.96      12.89   7.291 6.99e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 200 degrees of freedom
##   (628 observations deleted due to missingness)
## Multiple R-squared:  0.8883, Adjusted R-squared:  0.8871 
## F-statistic:   795 on 2 and 200 DF,  p-value: < 2.2e-16
cor.test(data$speed_ground, data$speed_air)
## 
##  Pearson's product-moment correlation
## 
## data:  data$speed_ground and data$speed_air
## t = 90.453, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9841163 0.9908449
## sample estimates:
##       cor 
## 0.9879383

we see both of them are highly correlated. We’ll remove speed_air as there are more missing values in the that column.

Variable selection

table0 = table0 %>%  filter(variable != "speed_air" )
features = table0$variable
models = list()
colSums(is.na(data_sd))
##     aircraft     duration      no_pasg speed_ground    speed_air       height 
##            0            0            0            0          628            0 
##        pitch     distance  logDistance         make 
##            0            0            0            0
modelStats = data.frame(matrix(ncol = 5, nrow = 0 ))
colnames(modelStats) = c("fm", "R-Sqr", "Adj. R-Sqr", "AIC", "BIC")
for(i in 1:length(features)){
  fm = paste("distance ~ ", paste(features[1:i], collapse = " + "))
  model = lm(fm, data = data_sd)
  aicVal =AIC(model)
  bicVal = BIC(model)
  sm = summary(model)
  print(paste("##################", fm, "###################"))
  print(sm)
  rsq = sm$r.squared
  arsq = sm$adj.r.squared
  modelStats[i,] = c(fm, rsq, arsq, aicVal, bicVal)
}
## [1] "################## distance ~  speed_ground ###################"
## 
## Call:
## lm(formula = fm, data = data_sd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00084 -0.35607 -0.08043  0.23522  2.00692 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.931e-16  1.734e-02    0.00        1    
## speed_ground 8.662e-01  1.735e-02   49.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4999 on 829 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7501 
## F-statistic:  2492 on 1 and 829 DF,  p-value: < 2.2e-16
## 
## [1] "################## distance ~  speed_ground + aircraft ###################"
## 
## Call:
## lm(formula = fm, data = data_sd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83595 -0.28591 -0.07563  0.16779  1.72017 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.25531    0.01988  -12.85   <2e-16 ***
## speed_ground    0.87731    0.01454   60.32   <2e-16 ***
## aircraftboeing  0.54823    0.02914   18.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4187 on 828 degrees of freedom
## Multiple R-squared:  0.8251, Adjusted R-squared:  0.8247 
## F-statistic:  1953 on 2 and 828 DF,  p-value: < 2.2e-16
## 
## [1] "################## distance ~  speed_ground + aircraft + height ###################"
## 
## Call:
## lm(formula = fm, data = data_sd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7943 -0.2530 -0.1006  0.1451  1.6421 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.25773    0.01849  -13.94   <2e-16 ***
## speed_ground    0.88632    0.01355   65.41   <2e-16 ***
## aircraftboeing  0.55341    0.02711   20.41   <2e-16 ***
## height          0.15444    0.01354   11.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3894 on 827 degrees of freedom
## Multiple R-squared:  0.8489, Adjusted R-squared:  0.8484 
## F-statistic:  1549 on 3 and 827 DF,  p-value: < 2.2e-16
## 
## [1] "################## distance ~  speed_ground + aircraft + height + pitch ###################"
## 
## Call:
## lm(formula = fm, data = data_sd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7997 -0.2500 -0.1040  0.1426  1.6745 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.25005    0.01908  -13.11   <2e-16 ***
## speed_ground    0.88686    0.01354   65.49   <2e-16 ***
## aircraftboeing  0.53693    0.02895   18.55   <2e-16 ***
## height          0.15382    0.01353   11.37   <2e-16 ***
## pitch           0.02327    0.01445    1.61    0.108    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.389 on 826 degrees of freedom
## Multiple R-squared:  0.8494, Adjusted R-squared:  0.8486 
## F-statistic:  1164 on 4 and 826 DF,  p-value: < 2.2e-16
## 
## [1] "################## distance ~  speed_ground + aircraft + height + pitch + duration ###################"
## 
## Call:
## lm(formula = fm, data = data_sd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7970 -0.2499 -0.1035  0.1443  1.6757 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.250143   0.019092 -13.102   <2e-16 ***
## speed_ground    0.887016   0.013567  65.381   <2e-16 ***
## aircraftboeing  0.537130   0.028982  18.533   <2e-16 ***
## height          0.153796   0.013542  11.357   <2e-16 ***
## pitch           0.023381   0.014467   1.616    0.106    
## duration        0.003177   0.013549   0.234    0.815    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3893 on 825 degrees of freedom
## Multiple R-squared:  0.8494, Adjusted R-squared:  0.8485 
## F-statistic: 930.5 on 5 and 825 DF,  p-value: < 2.2e-16
## 
## [1] "################## distance ~  speed_ground + aircraft + height + pitch + duration + no_pasg ###################"
## 
## Call:
## lm(formula = fm, data = data_sd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7636 -0.2503 -0.1016  0.1394  1.6886 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.249825   0.019084 -13.091   <2e-16 ***
## speed_ground    0.887007   0.013560  65.413   <2e-16 ***
## aircraftboeing  0.536446   0.028972  18.516   <2e-16 ***
## height          0.154665   0.013550  11.414   <2e-16 ***
## pitch           0.023123   0.014461   1.599    0.110    
## duration        0.002492   0.013551   0.184    0.854    
## no_pasg        -0.018327   0.013534  -1.354    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3891 on 824 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8486 
## F-statistic: 776.5 on 6 and 824 DF,  p-value: < 2.2e-16
modelStats = modelStats %>% mutate(across(c(2,3,4,5), as.double))

modelStats[,2:3] %>% gather() %>%  ggplot(aes(x=seq(1:12), y = value, group = key, col = key)) + 
  geom_line() + 
  geom_point() + 
  ggtitle("Variation in R-Sqr/ Adj.R-Sqr with addition of features") + 
  ylab("R-Sqr/ Adj.R-Sqr Value") + 
  xlab("No. of Features") + 
  theme( axis.text.x = element_blank())

modelStats[,4:5] %>% gather() %>%  ggplot(aes(x=seq(1:12), y = value, group = key, col = key)) + 
  geom_line() + 
  geom_point() + 
  ggtitle("Variation in AIC/ BIC with addition of features") + 
  ylab("AIC/BIC") + 
  xlab("No. of Features") + 
  theme( axis.text.x = element_blank())

We can see from the from R-squared graph, the R-squared value becomes flat after 3 features, same with Adj. R-Squared.The AIC and BIC curves also show very marginal improvement after 3 feaature. Hence, the above stats it looks likes Landing distance can be predicted with help of speed_ground, aircraft make and height. The estimated coeffs for the model are: Coefficients:

print(summary(lm( distance ~  speed_ground + aircraft + height, data = data)))
## 
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -711.95 -226.73  -90.17  130.04 1471.84 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2512.2433    68.1974  -36.84   <2e-16 ***
## speed_ground      42.4024     0.6483   65.41   <2e-16 ***
## aircraftboeing   496.0452    24.2975   20.41   <2e-16 ***
## height            14.1478     1.2405   11.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 349.1 on 827 degrees of freedom
## Multiple R-squared:  0.8489, Adjusted R-squared:  0.8484 
## F-statistic:  1549 on 3 and 827 DF,  p-value: < 2.2e-16

Model for Landing Distance

\[ Landing Distance = -2512.24 + 42.4 * speed\_ground + 496.04 * Aircraft Type(Boeing) + 14.14 * height \] The above equation can be used to predict the Landing Distance.

Feature selection with StepAIC fuction.

d = data[,c(1:4, 6:8)]
baseModel = lm(distance ~ 1, data = d)
topModel = lm(distance~., data = d)
AIC(baseModel)
## [1] 13660.08
AIC(topModel)
## [1] 12097.14
stepAIC(baseModel, direction = "forward", scope = list(upper = topModel, lower = baseModel))
## Start:  AIC=11299.8
## distance ~ 1
## 
##                Df Sum of Sq       RSS   AIC
## + speed_ground  1 500382567 166457762 10148
## + aircraft      1  37818390 629021939 11253
## + height        1   6590108 660250221 11294
## + pitch         1   5050617 661789712 11296
## + duration      1   1671325 665169005 11300
## <none>                      666840329 11300
## + no_pasg       1    210253 666630076 11302
## 
## Step:  AIC=10148.53
## distance ~ speed_ground
## 
##            Df Sum of Sq       RSS     AIC
## + aircraft  1  49848656 116609106  9854.8
## + height    1  14916377 151541385 10072.5
## + pitch     1   9765095 156692668 10100.3
## <none>                  166457762 10148.5
## + no_pasg   1    207528 166250234 10149.5
## + duration  1     49785 166407977 10150.3
## 
## Step:  AIC=9854.77
## distance ~ speed_ground + aircraft
## 
##            Df Sum of Sq       RSS    AIC
## + height    1  15848830 100760276 9735.4
## + pitch     1    455453 116153653 9853.5
## <none>                  116609106 9854.8
## + no_pasg   1     87171 116521935 9856.1
## + duration  1      8444 116600662 9856.7
## 
## Step:  AIC=9735.37
## distance ~ speed_ground + aircraft + height
## 
##            Df Sum of Sq       RSS    AIC
## + pitch     1    315259 100445017 9734.8
## <none>                  100760276 9735.4
## + no_pasg   1    232003 100528273 9735.5
## + duration  1      3971 100756305 9737.3
## 
## Step:  AIC=9734.77
## distance ~ speed_ground + aircraft + height + pitch
## 
##            Df Sum of Sq       RSS    AIC
## <none>                  100445017 9734.8
## + no_pasg   1    225608 100219409 9734.9
## + duration  1      6693 100438324 9736.7
## 
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height + pitch, 
##     data = d)
## 
## Coefficients:
##    (Intercept)    speed_ground  aircraftboeing          height           pitch  
##       -2664.32           42.43          481.27           14.09           39.61
stepAIC(baseModel,direction="forward",scope=list(upper=topModel,lower=baseModel))
## Start:  AIC=11299.8
## distance ~ 1
## 
##                Df Sum of Sq       RSS   AIC
## + speed_ground  1 500382567 166457762 10148
## + aircraft      1  37818390 629021939 11253
## + height        1   6590108 660250221 11294
## + pitch         1   5050617 661789712 11296
## + duration      1   1671325 665169005 11300
## <none>                      666840329 11300
## + no_pasg       1    210253 666630076 11302
## 
## Step:  AIC=10148.53
## distance ~ speed_ground
## 
##            Df Sum of Sq       RSS     AIC
## + aircraft  1  49848656 116609106  9854.8
## + height    1  14916377 151541385 10072.5
## + pitch     1   9765095 156692668 10100.3
## <none>                  166457762 10148.5
## + no_pasg   1    207528 166250234 10149.5
## + duration  1     49785 166407977 10150.3
## 
## Step:  AIC=9854.77
## distance ~ speed_ground + aircraft
## 
##            Df Sum of Sq       RSS    AIC
## + height    1  15848830 100760276 9735.4
## + pitch     1    455453 116153653 9853.5
## <none>                  116609106 9854.8
## + no_pasg   1     87171 116521935 9856.1
## + duration  1      8444 116600662 9856.7
## 
## Step:  AIC=9735.37
## distance ~ speed_ground + aircraft + height
## 
##            Df Sum of Sq       RSS    AIC
## + pitch     1    315259 100445017 9734.8
## <none>                  100760276 9735.4
## + no_pasg   1    232003 100528273 9735.5
## + duration  1      3971 100756305 9737.3
## 
## Step:  AIC=9734.77
## distance ~ speed_ground + aircraft + height + pitch
## 
##            Df Sum of Sq       RSS    AIC
## <none>                  100445017 9734.8
## + no_pasg   1    225608 100219409 9734.9
## + duration  1      6693 100438324 9736.7
## 
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height + pitch, 
##     data = d)
## 
## Coefficients:
##    (Intercept)    speed_ground  aircraftboeing          height           pitch  
##       -2664.32           42.43          481.27           14.09           39.61

The suggested model by stepAIC is to use aircraft, speed_ground, and height which is same as our previous model.

Final Model

print(summary(lm( distance ~  speed_ground + aircraft + height, data = data)))
## 
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -711.95 -226.73  -90.17  130.04 1471.84 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2512.2433    68.1974  -36.84   <2e-16 ***
## speed_ground      42.4024     0.6483   65.41   <2e-16 ***
## aircraftboeing   496.0452    24.2975   20.41   <2e-16 ***
## height            14.1478     1.2405   11.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 349.1 on 827 degrees of freedom
## Multiple R-squared:  0.8489, Adjusted R-squared:  0.8484 
## F-statistic:  1549 on 3 and 827 DF,  p-value: < 2.2e-16

\[ Landing Distance = -2512.24 + 42.4 * speed\_ground + 496.04 * Aircraft Type(Boeing) + 14.14 * height \]