Predicting flight landings using GLM Models

Introduction

Motivation: To reduce the risk of landing overrun.

Goal: To study what factors and how they would impact the landing distance of a commercial flight.

Packages Required

library(tidyverse)  #to visualize, transform, input, tidy and join data
library(dplyr)      #data wrangling
library(stringr)    #string related functions
library(kableExtra) #to create HTML Table
library(DT)         #to preview the data sets
library(lubridate)  #to apply the date functions
library(xlsx)       #to load excel files
library(ROCR)       #to use ROC curves
library(faraway)    #to use the ilogit function
library(nnet)       #to implement multinomial function

Initial Data Analysis

Data has following columns -

text_tbl <- data.frame (
  Variable = c("Aircraft", "Duration", "No_pasg", "speed_ground","speed_air","height", "pitch","distance"),
  Description = c(
    "make of an aircraft ",
    "Duration of flight",
    "no. of passengers ",
     "ground speed  ",
    "air speed ",
    "height ",
    "pitch angle ",
    "flight duration between take-off and landing "
  )
)

kable(text_tbl) %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = T) %>%
  column_spec(2, width = "30em")
Variable Description
Aircraft make of an aircraft
Duration Duration of flight
No_pasg no. of passengers
speed_ground ground speed
speed_air air speed
height height
pitch pitch angle
distance flight duration between take-off and landing

Exploratory Data Analysis

I load the two datasets-

faa1 <- read.xlsx("FAA1.xls", sheetName = "FAA1")
faa2 <- read.xlsx("FAA2_2.xls", sheetName = "Sheet1")

Data has 800 observations and 8 in FAA1. Data has 150 observations and 7 in FAA2.

FAA2 doesn’t contain information about the duration of flights

Merge the two data sets. Are there any duplications?

faa <- bind_rows(faa1, faa2)

str(faa)
## 'data.frame':    950 obs. of  8 variables:
##  $ aircraft    : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
#add the duplicate removal code
faa %>% 
  select(-duration) %>% 
  duplicated() %>% 
  sum() 
## [1] 100

There are 100 duplicated in total, which I have removed.

check <- faa %>%  
 select(-duration) %>% 
  duplicated() %>% 
  which()

faa <- faa[-check,]

Key findings-

I observed that few of the variables have incorrect data, which may be because of the issue with data capture or wrong data entry. For example-

  1. height has negative value as the minimum value

  2. the minimum distance for an observation is 34 which is too small.

  3. air_speed is not captured for 75% of the data.

  4. The minium duration of flight is 15 minutes, which doesn’t seem right

  5. Data had duplicate records(100) after merging the two data-sets

Data Cleaning and further exploration

Removing abnormal values in the data set.

faa_check <- faa %>% 
  filter((duration > 40| is.na(duration)) & (speed_ground >= 30) & (speed_ground <= 140) &
           (height >= 6) & (distance < 6000)) 
dim(faa_check)
## [1] 831   8
faa <- faa_check

A total of 19 observations seem abnormal which we remove.

Data has 831 observations and 8

We observe that Duration is null for 50 observations, which we need to look at. We will replace the value with mean of the overall column

faa$duration_corrected <- NA
faa <-  transform(faa, duration_corrected = ifelse(is.na(faa$duration), mean(faa$duration, na.rm=TRUE), faa$duration))

Plotting histogram for all the variables.

#hist(faa$duration_impute, main = "Histogram of Duration", xlab = "Duration")
hist(faa$speed_ground, main = "Histogram of Ground Speed", xlab = "Ground Speed")

hist(faa$height, main = "Histogram of Height", xlab = "Height")

hist(faa$pitch, main = "Histogram of Pitch", xlab = "Pitch")

hist(faa$no_pasg, main = "Histogram of No. of Passengers", xlab = "No. of Passengers")

hist(faa$speed_air, main = "Histogram of Air Speed", xlab = "Air Speed")

hist(faa$distance, main = "Histogram of Landing Distance", xlab = "Landing Distance")

hist(faa$duration_corrected, main = "Histogram of Duration of flight", xlab = "Flight Duration in mins")

Key findings:

After cleaning the data, I observed that - 1. There were total 19 abnormal values in the data

  1. Duration has 50 NA values, which we corrected based on the mean of the overall sample

  2. Speed of the air is right-skewed whereas all the other variables seem to be noramlly distributed

  3. Min speed of air is 90 MPH

Initial analysis for identifying important factors that impact the

response variable “landing distance”

Pairwise Correlation

cor_duration <- cor(faa$distance, faa$duration_corrected)
cor_speed_ground <- cor(faa$distance,faa$speed_ground)
cor_height <- cor(faa$distance,faa$height)
cor_pitch <- cor(faa$distance,faa$pitch)
cor_no_pasg <- cor(faa$distance,faa$no_pasg)
cor_speed_air <- cor.test(faa$distance,faa$speed_air,method="pearson")$estimate
cor_aircraft <- cor(faa$distance,as.numeric(faa$aircraft ))

variable_names <- c("Duration","Ground Speed","Height","Pitch",
                    "No. of Passengers","Air Speed","Aircraft")
correlation <- c(cor_duration,cor_speed_ground,cor_height,cor_pitch,cor_no_pasg,cor_speed_air,cor_aircraft)

table_1 <- data.frame(variable_names,correlation)

table_1$direction <- ifelse(table_1$correlation > 0, "Positive","Negative")

table_1 <- table_1 %>% arrange(desc(correlation))

Show X-Y scatter plots

faa <- faa[-2]
GGally::ggpairs(
  data = faa
)

Checking collinearity

We see that both air_speed and ground_speed are closely related to the landing distance. Since they are highly correlated, when we regress both the variables together, we see that ground_speed becomes insignificant as the variability due to speed_ground is explained by speed_Air and hence it is not contributing in explaining variation any more.

cor.test(faa$speed_ground, faa$speed_air, method = "pearson")$estimate
##       cor 
## 0.9879383

We observe 98% correlation. I would like to keep speed_air in my method because since R^2 and adj. R^2 is more when speed_air is considered.(comparison of the two models- 1 and 2.) Thus, speed of air is a significant contributor according to me.

Variable selection and Linear Regression

R squared vs No. of variables
model0 <- lm(distance ~ 1,data=faa)
model1 <- lm(distance ~ speed_air,data=faa)
model2 <- lm(distance ~ speed_air + aircraft,data=faa)
model3 <- lm(distance ~ speed_air + aircraft + height ,data=faa)
model4 <- lm(distance ~ speed_air + aircraft + height + no_pasg  ,data=faa)
model5 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected ,data=faa)
model6 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected + pitch,data=faa)

model0.rsqr <- summary(model0)$r.squared
model1.rsqr <- summary(model1)$r.squared
model2.rsqr <- summary(model2)$r.squared
model3.rsqr <- summary(model3)$r.squared
model4.rsqr <- summary(model4)$r.squared
model5.rsqr <- summary(model5)$r.squared
model6.rsqr <- summary(model6)$r.squared

rsquare <- cbind(c(model0.rsqr, model1.rsqr, model2.rsqr, model3.rsqr, model4.rsqr, model5.rsqr, model6.rsqr), 0:6) 

 colnames(rsquare) <- c("rsquare","variables") 

 rsquare <-  as.data.frame(rsquare)
 
 rsquare %>% 
   ggplot(aes(x = variables, y = rsquare)) + 
   geom_line() + 
   xlab("no. of variables") +
   ylab("R-square") +
   theme_classic()

With increase in variables, R^2 also increases.

Adjusted R^2 vs No. of variables
model0.rsqr <- summary(model0)$adj.r.squared
model1.rsqr <- summary(model1)$adj.r.squared
model2.rsqr <- summary(model2)$adj.r.squared
model3.rsqr <- summary(model3)$adj.r.squared
model4.rsqr <- summary(model4)$adj.r.squared
model5.rsqr <- summary(model5)$adj.r.squared
model6.rsqr <- summary(model6)$adj.r.squared

rsquare <- cbind(c(model0.rsqr, model1.rsqr, model2.rsqr, model3.rsqr, model4.rsqr, model5.rsqr, model6.rsqr), 0:6) 

 colnames(rsquare) <- c("adj_rsquare","variables") 

 rsquare <-  as.data.frame(rsquare)
 
 rsquare %>% 
   ggplot(aes(x = variables, y = adj_rsquare)) + 
   geom_line() + 
   xlab("no. of variables") +
   ylab("Adjusted R-square") +
   theme_classic()

We see that Adjusted R^2 increases initially but then it slowly starts declining after 3 variables have been added.

Model with AIC values
model0_AIC <- AIC(model0)
model1_AIC <- AIC(model1)
model2_AIC <- AIC(model2)
model3_AIC <- AIC(model3)
model4_AIC <- AIC(model4)
model5_AIC <- AIC(model5)
model6_AIC <- AIC(model6)

AIC <- cbind(c(model0_AIC, model1_AIC, model2_AIC, model3_AIC, model4_AIC, model5_AIC, model6_AIC),0:6)

colnames(AIC) <- c("AIC","variables")

AIC <- as.data.frame(AIC)

AIC %>% 
  ggplot(aes(x = variables, y = AIC)) +
  geom_line() +
  xlab("no. of variables")+
  ylab("AIC") +
  theme_classic()

since smaller the AIC, better is the model. Hence, we see that it decreases. However, after addition of 3 variables, the decrease in AIC isn’t much and infact it starts going up.

Suitable model

According to me, the significant model is -

model4 <- lm(distance ~ speed_air + aircraft + height ,data=faa)
summary(model4)
## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = faa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -300.74  -94.78   15.47   97.09  330.41 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6390.376    109.839  -58.18   <2e-16 ***
## speed_air         82.148      0.976   84.17   <2e-16 ***
## aircraftboeing   427.442     19.173   22.29   <2e-16 ***
## height            13.702      1.007   13.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.2 on 199 degrees of freedom
##   (628 observations deleted due to missingness)
## Multiple R-squared:  0.9737, Adjusted R-squared:  0.9733 
## F-statistic:  2458 on 3 and 199 DF,  p-value: < 2.2e-16

The R^2 and Adjusted R^2 are also close and around 97% which is quite right.

Variable selection based on automate algorithm

faaNoNA <- na.exclude(faa)
#nrow(faaNoNA)
model01 <- lm(distance ~ 1,data=faaNoNA)
model11 <- lm(distance ~ speed_air,data=faaNoNA)
model21 <- lm(distance ~ speed_air + aircraft,data=faaNoNA)
model31 <- lm(distance ~ speed_air + aircraft + height ,data=faaNoNA)
model41 <- lm(distance ~ speed_air + aircraft + height + no_pasg  ,data=faaNoNA)
model51 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected ,data=faaNoNA)
model61 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected + pitch,data=faaNoNA)

MASS::stepAIC(model01,direction="forward",scope=list(upper=model61,lower=model01))
## Start:  AIC=2725.93
## distance ~ 1
## 
##                      Df Sum of Sq       RSS    AIC
## + speed_air           1 121121738  15346230 2284.3
## + aircraft            1   4417704 132050265 2721.2
## <none>                            136467968 2725.9
## + height              1    633675 135834293 2727.0
## + duration_corrected  1    353842 136114126 2727.4
## + pitch               1    244706 136223262 2727.6
## + no_pasg             1    217724 136250245 2727.6
## 
## Step:  AIC=2284.33
## distance ~ speed_air
## 
##                      Df Sum of Sq      RSS    AIC
## + aircraft            1   8424832  6921399 2124.7
## + height              1   2803377 12542854 2245.4
## + pitch               1    860427 14485804 2274.6
## + no_pasg             1    159095 15187136 2284.2
## <none>                            15346230 2284.3
## + duration_corrected  1     11938 15334292 2286.2
## 
## Step:  AIC=2124.7
## distance ~ speed_air + aircraft
## 
##                      Df Sum of Sq     RSS    AIC
## + height              1   3335147 3586252 1993.2
## <none>                            6921399 2124.7
## + duration_corrected  1     61095 6860304 2124.9
## + no_pasg             1     56195 6865204 2125.0
## + pitch               1      2584 6918815 2126.6
## 
## Step:  AIC=1993.22
## distance ~ speed_air + aircraft + height
## 
##                      Df Sum of Sq     RSS    AIC
## + no_pasg             1     53331 3532921 1992.2
## <none>                            3586252 1993.2
## + duration_corrected  1     12912 3573340 1994.5
## + pitch               1       174 3586078 1995.2
## 
## Step:  AIC=1992.18
## distance ~ speed_air + aircraft + height + no_pasg
## 
##                      Df Sum of Sq     RSS    AIC
## <none>                            3532921 1992.2
## + duration_corrected  1    9532.8 3523389 1993.6
## + pitch               1     333.8 3532588 1994.2
## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height + no_pasg, 
##     data = faaNoNA)
## 
## Coefficients:
##    (Intercept)       speed_air  aircraftboeing          height  
##      -6248.753          82.131         425.592          13.696  
##        no_pasg  
##         -2.316

I observe that number of passengers is also considered a signicant contributor by automatic selection of variables.

so the final model as obtained here is - model4 i.e.

Distance = - 6263.754 + 82.032( speed_air) + 432.074 (aircraft) + 13.776(height) - 2.041(no_pasg)

From this, we can also conclude that there is no single criteria for selecting a model. Also there is no “best model”. Since, we have chosen AIC as our selection criteria, we see no_pasg also included in our model. However, if I try to include this variable using P_value method, no_pasg will turn out to be non-significant.

GLM(Logistic, Probit and CLoglog Models)

Segregating long but non-risky landing distance from risky distance.

faa <- faa %>% 
  mutate(long.landing = as.factor(ifelse(distance > 2500, 1,0 )) , 
         risky.landing  = as.factor(ifelse(distance > 3000,1,0 )),
         aircraft = as.factor(aircraft))
faa$duration <- NULL
faa$distance <- NULL

Histogram to show distribution of “long.landing”

faa %>% 
  ggplot(aes(long.landing)) + 
  geom_bar()

only 12% aircrafts have long landing

round(prop.table(table(faa$long.landing)),2)
## 
##    0    1 
## 0.88 0.12
mdl_duration <- glm (faa$long.landing ~ faa$duration_corrected, family = "binomial")
mdl_speedgrnd <- glm (faa$long.landing ~ faa$speed_ground, family = "binomial")
mdl_height <- glm (faa$long.landing ~ faa$height, family = "binomial")
mdl_pitch <- glm (faa$long.landing ~ faa$pitch, family = "binomial")
mdl_nopasg <- glm (faa$long.landing ~ faa$no_pasg, family = "binomial")
mdl_speedair <- glm (faa$long.landing ~ faa$speed_air, family = "binomial")
mdl_aircraft <- glm (faa$long.landing ~ faa$aircraft, family = "binomial")


duration <- summary(mdl_duration)$coef[2,c(1,4)]
speed_ground <- summary(mdl_speedgrnd)$coef[2,c(1,4)]
height <- summary(mdl_height)$coef[2,c(1,4)]
pitch <- summary(mdl_pitch)$coef[2,c(1,4)]
no_pasg <- summary(mdl_nopasg)$coef[2,c(1,4)]
speed_air <- summary(mdl_speedair)$coef[2,c(1,4)]
aircraft_boeing <- summary(mdl_aircraft)$coef[2,c(1,4)]
aircraft_airbus <- summary(mdl_aircraft)$coef[1,c(1,4)]

coefficients <- c(duration[1], speed_ground[1], height[1], pitch[1], no_pasg[1],speed_air[1],aircraft_boeing[1],aircraft_airbus[1])
coefficients <- round(coefficients, digits = 3)

odds_ratio <- round(exp(coefficients), 3)

p_value <- c(duration[2], speed_ground[2], height[2], pitch[2], no_pasg[2],speed_air[2],aircraft_boeing[2],aircraft_airbus[2]) 
p_value <- round(p_value, digits = 3)

variable_names <- c("Duration","Ground Speed","Height","Pitch","No. of Passengers","Air Speed","Aircraft-Boeing", "Aircraft-Airbus")

table_2 <- data.frame(variable_names, coefficients,odds_ratio, p_value)
table_2$slope_direction <- ifelse(coefficients > 0 , "Positive", "Negative")
table_2 <- table_2 %>% 
  select(variable_names, coefficients, odds_ratio, p_value, slope_direction) %>% 
  arrange(p_value)

table_2
##      variable_names coefficients odds_ratio p_value slope_direction
## 1      Ground Speed        0.472      1.603   0.000        Positive
## 2         Air Speed        0.512      1.669   0.000        Positive
## 3   Aircraft-Boeing        0.864      2.373   0.000        Positive
## 4   Aircraft-Airbus       -2.428      0.088   0.000        Negative
## 5             Pitch        0.401      1.493   0.047        Positive
## 6            Height        0.009      1.009   0.422        Positive
## 7 No. of Passengers       -0.007      0.993   0.606        Negative
## 8          Duration       -0.001      0.999   0.626        Negative

we see speed_ground, speed_air, aircraft type, pitch and height appear to be positively correlated to long_landing.

Let’s visualize it -

check_plot <- function(x) {
  ggplot(aes(x = x, fill = long.landing), data = faa) +
    geom_histogram(position = 'dodge', aes(y = ..density..))
}

The probability of long landing increases with the increase in speed_ground

check_plot(faa$speed_ground)

Probability of long landing increases with increase in speed of air

check_plot(faa$speed_air)

Long landing isn’t affected by pitch of aircraft

check_plot(faa$pitch)

Long landing seem to be unaffected by height of aircraft

check_plot(faa$height)

I observed that speed of ground, aircraft and height are significant. Pitch is not significant like we observed in the previous table

full_model <- glm(long.landing ~ aircraft + 
                    no_pasg + speed_ground + height + 
                    pitch  + duration_corrected, family = "binomial",
                  data = faa)
summary(full_model)
## 
## Call:
## glm(formula = long.landing ~ aircraft + no_pasg + speed_ground + 
##     height + pitch + duration_corrected, family = "binomial", 
##     data = faa)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.16087  -0.00052   0.00000   0.00000   2.32238  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.215e+02  2.500e+01  -4.858 1.19e-06 ***
## aircraftboeing      5.192e+00  1.200e+00   4.328 1.51e-05 ***
## no_pasg             3.423e-03  5.461e-02   0.063 0.950023    
## speed_ground        1.033e+00  2.082e-01   4.960 7.03e-07 ***
## height              2.531e-01  7.253e-02   3.490 0.000483 ***
## pitch               1.484e+00  8.454e-01   1.755 0.079247 .  
## duration_corrected  5.287e-03  7.864e-03   0.672 0.501355    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 622.778  on 830  degrees of freedom
## Residual deviance:  52.746  on 824  degrees of freedom
## AIC: 66.746
## 
## Number of Fisher Scoring iterations: 12

Based on AIC model, results seem to be consistent with the table observed.

faa_clean <-  select(faa, -risky.landing , -speed_air)
model01 <- glm(long.landing ~ 1,data=faa_clean, family = "binomial")
model61 <- glm(long.landing ~ speed_ground + aircraft + height + no_pasg + duration_corrected + pitch,data=faa_clean, family = "binomial")
model_1 <- MASS::stepAIC(model01,direction="forward",scope=list(upper=model61,lower=model01))
## Start:  AIC=624.78
## long.landing ~ 1
## 
##                      Df Deviance    AIC
## + speed_ground        1   115.47 119.47
## + aircraft            1   606.55 610.55
## + pitch               1   618.79 622.79
## <none>                    622.78 624.78
## + height              1   622.13 626.13
## + no_pasg             1   622.51 626.51
## + duration_corrected  1   622.54 626.54
## 
## Step:  AIC=119.47
## long.landing ~ speed_ground
## 
##                      Df Deviance     AIC
## + aircraft            1   84.665  90.665
## + height              1  100.459 106.459
## + pitch               1  105.527 111.527
## <none>                   115.470 119.470
## + duration_corrected  1  115.378 121.378
## + no_pasg             1  115.468 121.468
## 
## Step:  AIC=90.66
## long.landing ~ speed_ground + aircraft
## 
##                      Df Deviance    AIC
## + height              1   57.047 65.047
## + pitch               1   81.309 89.309
## <none>                    84.665 90.665
## + duration_corrected  1   83.164 91.164
## + no_pasg             1   84.219 92.219
## 
## Step:  AIC=65.05
## long.landing ~ speed_ground + aircraft + height
## 
##                      Df Deviance    AIC
## + pitch               1   53.204 63.204
## <none>                    57.047 65.047
## + duration_corrected  1   56.288 66.288
## + no_pasg             1   57.031 67.031
## 
## Step:  AIC=63.2
## long.landing ~ speed_ground + aircraft + height + pitch
## 
##                      Df Deviance    AIC
## <none>                    53.204 63.204
## + duration_corrected  1   52.750 64.750
## + no_pasg             1   53.204 65.204
summary(model_1)
## 
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft + height + 
##     pitch, family = "binomial", data = faa_clean)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.20284  -0.00054   0.00000   0.00000   2.35719  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -119.77598   24.41821  -4.905 9.33e-07 ***
## speed_ground      1.02266    0.20290   5.040 4.65e-07 ***
## aircraftboeing    5.13443    1.18091   4.348 1.37e-05 ***
## height            0.25795    0.06861   3.760  0.00017 ***
## pitch             1.53751    0.84109   1.828  0.06755 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 622.778  on 830  degrees of freedom
## Residual deviance:  53.204  on 826  degrees of freedom
## AIC: 63.204
## 
## Number of Fisher Scoring iterations: 12

We further use BIC measure to find “best subsets” model. Pitch is not chosen by this model. This may be because BIC penalizes strongly

model_2 <- MASS::stepAIC(model01,direction="forward",scope=list(upper=model61,lower=model01), k = log(nrow(faa_clean)))
## Start:  AIC=629.5
## long.landing ~ 1
## 
##                      Df Deviance    AIC
## + speed_ground        1   115.47 128.92
## + aircraft            1   606.55 620.00
## <none>                    622.78 629.50
## + pitch               1   618.79 632.24
## + height              1   622.13 635.58
## + no_pasg             1   622.51 635.96
## + duration_corrected  1   622.54 635.98
## 
## Step:  AIC=128.92
## long.landing ~ speed_ground
## 
##                      Df Deviance    AIC
## + aircraft            1   84.665 104.83
## + height              1  100.459 120.63
## + pitch               1  105.527 125.69
## <none>                   115.470 128.92
## + duration_corrected  1  115.378 135.54
## + no_pasg             1  115.468 135.64
## 
## Step:  AIC=104.83
## long.landing ~ speed_ground + aircraft
## 
##                      Df Deviance     AIC
## + height              1   57.047  83.937
## <none>                    84.665 104.832
## + pitch               1   81.309 108.200
## + duration_corrected  1   83.164 110.054
## + no_pasg             1   84.219 111.110
## 
## Step:  AIC=83.94
## long.landing ~ speed_ground + aircraft + height
## 
##                      Df Deviance    AIC
## <none>                    57.047 83.937
## + pitch               1   53.204 86.817
## + duration_corrected  1   56.288 89.901
## + no_pasg             1   57.031 90.644
summary(model_2)
## 
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft + height, 
##     family = "binomial", data = faa_clean)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.43442  -0.00117   0.00000   0.00000   2.57435  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -102.95437   19.22882  -5.354 8.59e-08 ***
## speed_ground      0.92657    0.17242   5.374 7.70e-08 ***
## aircraftboeing    5.04813    1.11520   4.527 5.99e-06 ***
## height            0.23106    0.05959   3.877 0.000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 622.778  on 830  degrees of freedom
## Residual deviance:  57.047  on 827  degrees of freedom
## AIC: 65.047
## 
## Number of Fisher Scoring iterations: 11

The variables and their contribution in prediction of long landing is -

table_2
##      variable_names coefficients odds_ratio p_value slope_direction
## 1      Ground Speed        0.472      1.603   0.000        Positive
## 2         Air Speed        0.512      1.669   0.000        Positive
## 3   Aircraft-Boeing        0.864      2.373   0.000        Positive
## 4   Aircraft-Airbus       -2.428      0.088   0.000        Negative
## 5             Pitch        0.401      1.493   0.047        Positive
## 6            Height        0.009      1.009   0.422        Positive
## 7 No. of Passengers       -0.007      0.993   0.606        Negative
## 8          Duration       -0.001      0.999   0.626        Negative

The various plots that helped us understand the relationship between the variables better-

The probability of long landing increases with the increase in speed_ground

check_plot(faa$speed_ground)

Probability of long landing increases with increase in speed of air

check_plot(faa$speed_air)

Long landing isn’t affected by pitch of aircraft

check_plot(faa$pitch)

Long landing seem to be unaffected by height of aircraft

check_plot(faa$height)

Based on our analysis, our final model is: long.landing ~ speed_ground + aircraft + height

Risky Landing

Repeating all the steps for risky landing—

faa %>% 
  ggplot(aes(risky.landing)) + 
  geom_bar()

only 7% aircrafts have long landing

round(prop.table(table(faa$risky.landing)),2)
## 
##    0    1 
## 0.93 0.07

The speed of ground, air speed and aircraft make seems to be likely to affect the risky landing.

##      variable_names coefficients odds_ratio p_value slope_direction
## 1      Ground Speed        0.614      1.848   0.000        Positive
## 2         Air Speed        0.870      2.387   0.000        Positive
## 3   Aircraft-Boeing        1.002      2.724   0.000        Positive
## 4   Aircraft-Airbus       -3.108      0.045   0.000        Negative
## 5             Pitch        0.371      1.449   0.143        Positive
## 6 No. of Passengers       -0.025      0.975   0.154        Negative
## 7          Duration       -0.001      0.999   0.674        Negative
## 8            Height       -0.002      0.998   0.871        Negative

we see speed_ground, speed_air, aircraft type, pitch and height appear to be positively correlated to long_landing.

Let’s visualize it -

check_plot_risky <- function(x) {
  ggplot(aes(x = x, fill = risky.landing), data = faa) +
    geom_histogram(position = 'dodge', aes(y = ..density..))
}

The probability of riksy landing increases with the increase in speed_ground

check_plot_risky(faa$speed_ground)

Probability of risky landing increases with increase in speed of air

check_plot_risky(faa$speed_air)

Risky landing isn’t affected by pitch of aircraft

check_plot_risky(faa$pitch)

Risky landing seem to be unaffected by height of aircraft

check_plot_risky(faa$height)

I observed that speed of ground and make of aircraft are significant. Results seem to be consistent with our observations before

full_model <- glm(risky.landing ~ aircraft + 
                    no_pasg + speed_ground + height + 
                    pitch  + duration_corrected, family = "binomial",
                  data = faa)
summary(full_model)
## 
## Call:
## glm(formula = risky.landing ~ aircraft + no_pasg + speed_ground + 
##     height + pitch + duration_corrected, family = "binomial", 
##     data = faa)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.46375  -0.00009   0.00000   0.00000   1.85765  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.035e+02  2.801e+01  -3.697 0.000218 ***
## aircraftboeing      4.457e+00  1.547e+00   2.881 0.003970 ** 
## no_pasg            -8.620e-02  6.035e-02  -1.428 0.153201    
## speed_ground        9.488e-01  2.465e-01   3.848 0.000119 ***
## height              4.310e-02  4.624e-02   0.932 0.351382    
## pitch               6.139e-01  7.982e-01   0.769 0.441878    
## duration_corrected  7.952e-04  1.224e-02   0.065 0.948209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 436.043  on 830  degrees of freedom
## Residual deviance:  36.473  on 824  degrees of freedom
## AIC: 50.473
## 
## Number of Fisher Scoring iterations: 12

Results seem to be inconsistent with our observation before. Based on AIC measure, number of passengers seems to be a significant variable whereas if we look at the p_value, it doesn’t support our hypothesis.

faa_clean2 <-  select(faa, -long.landing , -speed_air)
model_null2 <- glm(risky.landing ~ 1,data=faa_clean2, family = "binomial")
model_full2 <- glm(risky.landing ~ speed_ground + aircraft + height + no_pasg + duration_corrected + pitch,data=faa_clean2, family = "binomial")
model_3 <- MASS::stepAIC(model_null2,direction="forward",scope=list(upper=model_full2,lower=model_null2))
## Start:  AIC=438.04
## risky.landing ~ 1
## 
##                      Df Deviance    AIC
## + speed_ground        1    58.93  62.93
## + aircraft            1   422.74 426.74
## + pitch               1   433.89 437.89
## + no_pasg             1   434.00 438.00
## <none>                    436.04 438.04
## + duration_corrected  1   435.86 439.86
## + height              1   436.02 440.02
## 
## Step:  AIC=62.93
## risky.landing ~ speed_ground
## 
##                      Df Deviance    AIC
## + aircraft            1   40.097 46.097
## + pitch               1   53.079 59.079
## <none>                    58.931 62.931
## + no_pasg             1   58.318 64.318
## + height              1   58.667 64.667
## + duration_corrected  1   58.883 64.883
## 
## Step:  AIC=46.1
## risky.landing ~ speed_ground + aircraft
## 
##                      Df Deviance    AIC
## + no_pasg             1   37.707 45.707
## <none>                    40.097 46.097
## + height              1   39.402 47.402
## + duration_corrected  1   39.884 47.884
## + pitch               1   39.928 47.928
## 
## Step:  AIC=45.71
## risky.landing ~ speed_ground + aircraft + no_pasg
## 
##                      Df Deviance    AIC
## <none>                    37.707 45.707
## + height              1   37.099 47.099
## + pitch               1   37.449 47.449
## + duration_corrected  1   37.693 47.693
summary(model_3)
## 
## Call:
## glm(formula = risky.landing ~ speed_ground + aircraft + no_pasg, 
##     family = "binomial", data = faa_clean2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.33913  -0.00009   0.00000   0.00000   1.87810  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -99.90780   25.57993  -3.906 9.39e-05 ***
## speed_ground     0.94963    0.23559   4.031 5.56e-05 ***
## aircraftboeing   4.64188    1.47520   3.147  0.00165 ** 
## no_pasg         -0.08462    0.05732  -1.476  0.13987    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 436.043  on 830  degrees of freedom
## Residual deviance:  37.707  on 827  degrees of freedom
## AIC: 45.707
## 
## Number of Fisher Scoring iterations: 12

Number of passengers, which seemed to be significant when considering AIC as the method for variable selection seem to be not significant when considering BIC. This may be because BIC penalizes strongly

model_4 <- MASS::stepAIC(model_null2,direction="forward",scope=list(upper=model_full2,lower=model_null2), k = log(nrow(faa_clean2)))
## Start:  AIC=442.77
## risky.landing ~ 1
## 
##                      Df Deviance    AIC
## + speed_ground        1    58.93  72.38
## + aircraft            1   422.74 436.18
## <none>                    436.04 442.77
## + pitch               1   433.89 447.34
## + no_pasg             1   434.00 447.45
## + duration_corrected  1   435.86 449.31
## + height              1   436.02 449.46
## 
## Step:  AIC=72.38
## risky.landing ~ speed_ground
## 
##                      Df Deviance    AIC
## + aircraft            1   40.097 60.264
## <none>                    58.931 72.376
## + pitch               1   53.079 73.247
## + no_pasg             1   58.318 78.486
## + height              1   58.667 78.835
## + duration_corrected  1   58.883 79.051
## 
## Step:  AIC=60.26
## risky.landing ~ speed_ground + aircraft
## 
##                      Df Deviance    AIC
## <none>                    40.097 60.264
## + no_pasg             1   37.707 64.597
## + height              1   39.402 66.292
## + duration_corrected  1   39.884 66.775
## + pitch               1   39.928 66.819
summary(model_4)
## 
## Call:
## glm(formula = risky.landing ~ speed_ground + aircraft, family = "binomial", 
##     data = faa_clean2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.24398  -0.00011   0.00000   0.00000   1.61021  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -102.0772    24.7751  -4.120 3.79e-05 ***
## speed_ground      0.9263     0.2248   4.121 3.78e-05 ***
## aircraftboeing    4.0190     1.2494   3.217   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 436.043  on 830  degrees of freedom
## Residual deviance:  40.097  on 828  degrees of freedom
## AIC: 46.097
## 
## Number of Fisher Scoring iterations: 12

Conclusions for both the type of landings -

  1. Speed of ground, height and type of aircraft are significant predictors of long landing.

  2. Only speed of ground and type of aircraft are significant predictors of risky landing and height of aircraft seems to be not playing a role when landing is risky.

  3. BIC for long landing is 65.047 and for risky landing is 46.097.

  4. AIC for long landing is 63.204 and for long landing is 45.707

Model Assessment

ROC Curves:

Plot for risky landing is marginally smoother than that for long landing.

data1 <- select(faa, -risky.landing, -speed_air)
data2 <- select(faa, -long.landing, -speed_air)

pred1 <- prediction(predict(model_2), data1$long.landing)
roc1 <- performance(pred1, "tpr", "fpr", main = "ROC for long landing and risky landing")
plot(roc1)

pred2 <- prediction(predict(model_4), data2$risky.landing)
roc2 <- performance(pred2, "tpr", "fpr")
plot(roc2, add = TRUE, colorize = TRUE, main = "ROC for long landing and risky landing")

The AUC in case of long landing is 99.6% and that for risky landing is 99.9%.

#long landing
auc_ROCR1 <- performance(pred1, measure = "auc")
 auc_ROCR1@y.values[[1]]
## [1] 0.998333
 #risky landing 
auc_ROCR2 <- performance(pred2, measure = "auc")
auc_ROCR2@y.values[[1]]  
## [1] 0.9986161

Prediction:

Given few parameters,

  • Aircraft = Boeing
  • Duration = 200
  • no_pasg = 80
  • speed_ground = 115
  • speed_air = 120
  • height = 40
  • pitch = 4

The long landing probability -

new.ind <- data.frame(aircraft = "boeing", 
                      duration_corrected = 200, 
                      no_pasg = 80,
                      speed_ground = 115, 
                      speed_air = 120,
                      height = 40,
                      pitch = 4)

pred1 <- predict(model_2,newdata=new.ind, type = "link", se = T)

fit <- ilogit(pred1$fit)
upper <- ilogit(pred1$fit + (1.96 * pred1$se.fit))
lower <- ilogit(pred1$fit - (1.96 * pred1$se.fit))

cat("The confidence interval for long landing-",lower,"||", fit, "||", upper)
## The confidence interval for long landing- 0.999985 || 1 || 1

The risky landing probability -

new.ind <- data.frame(aircraft = "boeing", 
                      duration_corrected = 200, 
                      no_pasg = 80,
                      speed_ground = 115, 
                      speed_air = 120,
                      height = 40,
                      pitch = 4)

pred1 <- predict(model_4,newdata=new.ind,type = "link", se = T)

fit <- ilogit(pred1$fit)
upper <- ilogit(pred1$fit + (1.96 * pred1$se.fit))
lower <- ilogit(pred1$fit - (1.96 * pred1$se.fit))

cat("The confidence interval for risky landing-",lower,"||", fit, "||", upper)
## The confidence interval for risky landing- 0.9874843 || 0.999789 || 0.9999965

Multinomial Regression

We will create a mulitnomial variabe for distance.

faa1 <- faa %>% 
  mutate(Y = (ifelse(distance < 1000, 1, 
                     ifelse( distance >= 1000 & distance < 2000, 2, 3)) ))
faa1$distance <- NULL

Now, we will use multinomial model to fit Y. We treat the new variable Y as categorical under the assumption that the levels of Y have no natural ordering.

faa1$Y <- as.factor(faa1$Y)
faa1 <-  select(faa1, -speed_air ) %>% 
  na.omit()
mmod <- multinom(Y ~ aircraft + duration +
                   no_pasg + speed_ground + pitch + height , faa1)
## # weights:  24 (14 variable)
## initial  value 858.016197 
## iter  10 value 581.199724
## iter  20 value 236.239538
## iter  30 value 226.272595
## iter  40 value 226.221087
## iter  50 value 226.174784
## final  value 226.164124 
## converged

Based on AIC, we get the model as -

mmodi <- step(mmod)
## Start:  AIC=480.33
## Y ~ aircraft + duration + no_pasg + speed_ground + pitch + height
## 
## trying - aircraft 
## # weights:  21 (12 variable)
## initial  value 858.016197 
## iter  10 value 599.571323
## iter  20 value 336.226736
## iter  30 value 334.049994
## iter  40 value 334.045290
## final  value 334.043139 
## converged
## trying - duration 
## # weights:  21 (12 variable)
## initial  value 858.016197 
## iter  10 value 518.225762
## iter  20 value 239.417331
## iter  30 value 228.258109
## iter  40 value 227.608269
## final  value 227.113768 
## converged
## trying - no_pasg 
## # weights:  21 (12 variable)
## initial  value 858.016197 
## iter  10 value 602.083949
## iter  20 value 240.843149
## iter  30 value 229.631112
## iter  40 value 228.963807
## final  value 228.210500 
## converged
## trying - speed_ground 
## # weights:  21 (12 variable)
## initial  value 858.016197 
## iter  10 value 804.987828
## final  value 794.030499 
## converged
## trying - pitch 
## # weights:  21 (12 variable)
## initial  value 858.016197 
## iter  10 value 582.954258
## iter  20 value 237.697711
## iter  30 value 228.361531
## iter  40 value 227.746783
## final  value 227.327445 
## converged
## trying - height 
## # weights:  21 (12 variable)
## initial  value 858.016197 
## iter  10 value 535.918721
## iter  20 value 302.074403
## iter  30 value 298.921547
## iter  40 value 298.913287
## final  value 298.909431 
## converged
##                Df       AIC
## - duration     12  478.2275
## - pitch        12  478.6549
## <none>         14  480.3282
## - no_pasg      12  480.4210
## - height       12  621.8189
## - aircraft     12  692.0863
## - speed_ground 12 1612.0610
## # weights:  21 (12 variable)
## initial  value 858.016197 
## iter  10 value 518.225762
## iter  20 value 239.417331
## iter  30 value 228.258109
## iter  40 value 227.608269
## final  value 227.113768 
## converged
## 
## Step:  AIC=478.23
## Y ~ aircraft + no_pasg + speed_ground + pitch + height
## 
## trying - aircraft 
## # weights:  18 (10 variable)
## initial  value 858.016197 
## iter  10 value 465.319936
## iter  20 value 335.496129
## iter  30 value 335.298663
## final  value 335.293692 
## converged
## trying - no_pasg 
## # weights:  18 (10 variable)
## initial  value 858.016197 
## iter  10 value 444.634863
## iter  20 value 238.604117
## iter  30 value 230.208561
## iter  40 value 229.083783
## iter  40 value 229.083781
## iter  40 value 229.083781
## final  value 229.083781 
## converged
## trying - speed_ground 
## # weights:  18 (10 variable)
## initial  value 858.016197 
## iter  10 value 799.384355
## final  value 796.687299 
## converged
## trying - pitch 
## # weights:  18 (10 variable)
## initial  value 858.016197 
## iter  10 value 488.119245
## iter  20 value 236.788749
## iter  30 value 228.715466
## iter  40 value 228.221740
## final  value 228.220496 
## converged
## trying - height 
## # weights:  18 (10 variable)
## initial  value 858.016197 
## iter  10 value 454.961171
## iter  20 value 300.709209
## iter  30 value 300.281087
## final  value 300.277702 
## converged
##                Df       AIC
## - pitch        10  476.4410
## - no_pasg      10  478.1676
## <none>         12  478.2275
## - height       10  620.5554
## - aircraft     10  690.5874
## - speed_ground 10 1613.3746
## # weights:  18 (10 variable)
## initial  value 858.016197 
## iter  10 value 488.119245
## iter  20 value 236.788749
## iter  30 value 228.715466
## iter  40 value 228.221740
## final  value 228.220496 
## converged
## 
## Step:  AIC=476.44
## Y ~ aircraft + no_pasg + speed_ground + height
## 
## trying - aircraft 
## # weights:  15 (8 variable)
## initial  value 858.016197 
## iter  10 value 394.729448
## iter  20 value 347.225190
## iter  30 value 347.118971
## final  value 347.118955 
## converged
## trying - no_pasg 
## # weights:  15 (8 variable)
## initial  value 858.016197 
## iter  10 value 354.469061
## iter  20 value 242.186395
## iter  30 value 231.398332
## final  value 230.124753 
## converged
## trying - speed_ground 
## # weights:  15 (8 variable)
## initial  value 858.016197 
## iter  10 value 797.491849
## final  value 797.489048 
## converged
## trying - height 
## # weights:  15 (8 variable)
## initial  value 858.016197 
## iter  10 value 363.948352
## iter  20 value 301.923434
## iter  30 value 301.017662
## final  value 301.004927 
## converged
##                Df       AIC
## - no_pasg       8  476.2495
## <none>         10  476.4410
## - height        8  618.0099
## - aircraft      8  710.2379
## - speed_ground  8 1610.9781
## # weights:  15 (8 variable)
## initial  value 858.016197 
## iter  10 value 354.469061
## iter  20 value 242.186395
## iter  30 value 231.398332
## final  value 230.124753 
## converged
## 
## Step:  AIC=476.25
## Y ~ aircraft + speed_ground + height
## 
## trying - aircraft 
## # weights:  12 (6 variable)
## initial  value 858.016197 
## iter  10 value 378.591341
## iter  20 value 351.025155
## final  value 350.780781 
## converged
## trying - speed_ground 
## # weights:  12 (6 variable)
## initial  value 858.016197 
## iter  10 value 797.741960
## final  value 797.741923 
## converged
## trying - height 
## # weights:  12 (6 variable)
## initial  value 858.016197 
## iter  10 value 345.462031
## iter  20 value 304.390327
## final  value 301.955008 
## converged
##                Df       AIC
## <none>          8  476.2495
## - height        6  615.9100
## - aircraft      6  713.5616
## - speed_ground  6 1607.4838
summary(mmodi)
## Call:
## multinom(formula = Y ~ aircraft + speed_ground + height, data = faa1)
## 
## Coefficients:
##   (Intercept) aircraftboeing speed_ground    height
## 2   -22.80493       3.863325    0.2393831 0.1516421
## 3  -103.34903       9.209321    1.0112370 0.3441978
## 
## Std. Errors:
##   (Intercept) aircraftboeing speed_ground     height
## 2  1.91262330      0.4056023  0.019916715 0.01813334
## 3  0.08472895      0.5723513  0.009967598 0.02808654
## 
## Residual Deviance: 460.2495 
## AIC: 476.2495

Now I want to see where the mean values lie for all the variables at different Y.

duration <- tapply(faa1$duration, faa1$Y, mean, na.rm=TRUE)
no_pasg <- tapply(faa1$no_pasg, faa1$Y, mean, na.rm=TRUE)
speed_ground <- tapply(faa1$speed_ground, faa1$Y, mean, na.rm=TRUE)
height <- tapply(faa1$height, faa1$Y, mean, na.rm=TRUE)
pitch <- tapply(faa1$pitch, faa1$Y, mean, na.rm=TRUE)

table <- round(data.frame(duration, no_pasg, speed_ground , height, pitch),3) %>% 
  t() %>% as.data.frame()

names(table) <- c('Y=1','Y=2','Y=3')
table$variable <- rownames(table)
rownames(table) <- NULL

table <- select(table, variable, everything())
table
##       variable     Y=1     Y=2     Y=3
## 1     duration 160.890 151.935 152.070
## 2      no_pasg  60.233  60.162  59.728
## 3 speed_ground  62.178  79.243 103.649
## 4       height  27.908  31.677  31.509
## 5        pitch   3.980   4.012   4.064
  1. A one-unit increase in speed_ground
  • increases the odds of Y=2 by 1.26 relative to Y=1

  • increases the odds of Y=3 by 2.75 relative to Y=1

  1. A one-unit increase in height
  • increases the odds of Y=2 by 1.16 relative to Y=1

  • increases the odds of Y=3 by 1.41 relative to Y=1

  1. When looking at Boeing aircraft as compared to Airbus,
  • odds of Y=2 increases by 47 relative to Y=1

  • odds of Y=3 increases by 9989 relative to Y=1

  1. Aircraft, speed of ground, and height are the final vairables that I obtained.

  2. There is a significant increase in speed of ground with distance.

ggplot(faa1 , aes(x = Y, y = speed_ground)) +
  geom_boxplot()

  1. There is a slight increase in the height of aircraft as distance increases.
ggplot(faa1 , aes(x = Y, y = height)) +
  geom_boxplot()