FAA Analysis - Part 1

Problem Statement

Background: Flight landing.

Motivation: To reduce the risk of landing overrun.

Goal: To study what factors and how they would impact the landing distance of a commercial flight - using linear regression

Data

Data set: Landing data from 950 commercial flights

Variables:

Aircraft: The make of an aircraft (Boeing or Airbus).

Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.

No_pasg: The number of passengers in a flight.

Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.

Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.

Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

Initial exploration of the data

library(tidyverse)  #to visualize, transform, input, tidy and join data
library(haven)      #to input data from SAS
library(dplyr)      #data wrangling
library(stringr)    #string related functions
library(kableExtra) #to create HTML Table
library(DT)         #to preview the data sets
library(ggplot2)    #to vizualize data

Step1: Loading data files

FAA1<-as_tibble(read.csv("FAA1.csv",header=T))
FAA2<-as_tibble(read.csv("FAA2.csv",header=T))

head(FAA1)
## # A tibble: 6 x 8
##   aircraft duration no_pasg speed_ground speed_air height pitch distance
##   <fct>       <dbl>   <int>        <dbl>     <dbl>  <dbl> <dbl>    <dbl>
## 1 boeing       98.5      53        108.       109.   27.4  4.04    3370.
## 2 boeing      126.       69        102.       103.   27.8  4.12    2988.
## 3 boeing      112.       61         71.1       NA    18.6  4.43    1145.
## 4 boeing      197.       56         85.8       NA    30.7  3.88    1664.
## 5 boeing       90.1      70         59.9       NA    32.4  4.03    1050.
## 6 boeing      138.       55         75.0       NA    41.2  4.20    1627.
head(FAA2)
## # A tibble: 6 x 7
##   aircraft no_pasg speed_ground speed_air height pitch distance
##   <fct>      <int>        <dbl>     <dbl>  <dbl> <dbl>    <dbl>
## 1 boeing        53        108.       109.   27.4  4.04    3370.
## 2 boeing        69        102.       103.   27.8  4.12    2988.
## 3 boeing        61         71.1       NA    18.6  4.43    1145.
## 4 boeing        56         85.8       NA    30.7  3.88    1664.
## 5 boeing        70         59.9       NA    32.4  4.03    1050.
## 6 boeing        55         75.0       NA    41.2  4.20    1627.


Step2: Checking structure of data sets

str(FAA1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    800 obs. of  8 variables:
##  $ aircraft    : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : int  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
str(FAA2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    150 obs. of  7 variables:
##  $ aircraft    : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ no_pasg     : int  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
  • FAA1: 800 obs. of 8 variables
  • FAA2: 150 obs. of 7 variables

We can see that duration variable is present in FAA1 data set but not in FAA2 data set

Step3: Merging data sets and removing duplicates

# adding a column duration with all NA's to FAA2
FAA2$duration <- NA

# merging data sets
FAA_merged <- rbind(FAA1,FAA2)
dim(FAA_merged)
## [1] 950   8
# removing duplicates
duplicate_index <- duplicated(FAA_merged[,-2])
FAA_merged1 <- FAA_merged[!duplicate_index,]
dim(FAA_merged1)
## [1] 850   8

100 duplicate values from the combined data set removed

Step4: Structure and summary statistics of combined data set

str(FAA_merged1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    850 obs. of  8 variables:
##  $ aircraft    : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : int  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
summary(FAA_merged1)
##    aircraft      duration         no_pasg      speed_ground   
##  airbus:450   Min.   : 14.76   Min.   :29.0   Min.   : 27.74  
##  boeing:400   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90  
##               Median :153.95   Median :60.0   Median : 79.64  
##               Mean   :154.01   Mean   :60.1   Mean   : 79.45  
##               3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06  
##               Max.   :305.62   Max.   :87.0   Max.   :141.22  
##               NA's   :50                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.25   1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :101.15   Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :103.80   Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:109.40   3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :642

FAA_Merged1: After removing duplicates, we have 850 observations and 8 variables

Step5: Summary

  • There are 2 types of aircrafts: airbus and boeing. We may have to observe some of the variables separately to see if there are some differences between these aircrafts.

  • Speed_air has 642 NA’s. We need to evaluate if we have to keep speed_air in our analysis or not.

  • 100 duplicate values were present after merging the 2 files.

  • There are abnormal values present for different variables. For example, the minimum value of height is negative, minimum value for duration is only 14 minutes, some planes have landing distance>6000 feet, speed_ground is also above specified normal limits in some cases, etc.

Step6: Removing abnormalities

FAA_merged2 <- FAA_merged1 %>%  
              filter((is.na(duration) | duration >40) & 
                       (speed_ground>=30 & speed_ground<=140) & 
                                        (height>=6) & (distance<6000)  & 
                                          (is.na(speed_air) | (speed_air>=30 & speed_air<=140)))

dim(FAA_merged2)
## [1] 831   8



Step7: Summary



Step8: Checking Distribution of Variables

par(mfrow=c(2,4))
hist(FAA_merged2$duration, main="Distribution",xlab= "Duration", col= "darkmagenta")
hist(FAA_merged2$no_pasg, main="Distribution",xlab= "No_pasg", col= "darkmagenta")
hist(FAA_merged2$speed_ground, main="Distribution",xlab= "Speed_ground", col= "darkmagenta")
hist(FAA_merged2$speed_air, main="Distribution",xlab= "Speed_air", col= "darkmagenta")
hist(FAA_merged2$height, main="Distribution",xlab= "Height", col= "darkmagenta")
hist(FAA_merged2$pitch, main="Distribution",xlab= "Pitch", col= "darkmagenta")
hist(FAA_merged2$distance, main="Distribution",xlab= "Distance", col= "darkmagenta")

Step9: Summary

  • Speed_air has missing observations after cleaning.

  • Distribution of all variables except speed_air and distance seem to follow normal distribution. Speed air and distance are skewed to right.

  • Speed_air has missing observations after cleaning.



Step 10, 11, and 12: Pairwise Correlation and Scatter Plots

GGally::ggpairs(data = FAA_merged2)

Yes. The scatter plots match with the correlation values obtained (for linear relationship)



Step 13: Regression of Y on each of X

variables <- c("aircraft", "duration","no_pasg" ,"speed_ground", "speed_air" ,"height" ,"pitch")   
coeff13 <- rep(NA,length(variables))
p_val13 <- rep(NA,length(variables))


fit_all13 <- data.frame(variables, coeff13,p_val13)

for (i in seq_along(variables))
{
fit_all13$coeff13[i] <- summary(lm(FAA_merged2$distance ~  FAA_merged2[[variables[i]]]))$coefficients[,1][2]
fit_all13$p_val13[i] <- summary(lm(FAA_merged2$distance ~  FAA_merged2[[variables[i]]]))$coefficients[,4][2]
}

table2 <- fit_all13 %>% mutate(sign_coef13 = ifelse(coeff13>0,"+","-")) %>% select(-coeff13) %>% arrange(p_val13)
table2
##      variables       p_val13 sign_coef13
## 1 speed_ground 4.766371e-252           +
## 2    speed_air  2.500461e-97           +
## 3     aircraft  3.526194e-12           +
## 4       height  4.123860e-03           +
## 5        pitch  1.208124e-02           +
## 6     duration  1.514002e-01           -
## 7      no_pasg  6.092520e-01           -



Step 14: Regression of Y on each of X (scaled variables)

scaled_FAA_merged2 <- as.data.frame(scale(FAA_merged2[,-1])) 
scaled_FAA_merged2$aircraft <- FAA_merged2$aircraft
scaled_FAA_merged2 <- scaled_FAA_merged2 %>% select(aircraft, everything()) 
summary(scaled_FAA_merged2)
##    aircraft      duration           no_pasg           speed_ground     
##  airbus:444   Min.   :-2.33354   Min.   :-4.145514   Min.   :-2.45353  
##  boeing:387   1st Qu.:-0.72687   1st Qu.:-0.674829   1st Qu.:-0.71220  
##               Median :-0.01016   Median :-0.007389   Median : 0.01341  
##               Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000  
##               3rd Qu.: 0.72156   3rd Qu.: 0.660050   3rd Qu.: 0.65999  
##               Max.   : 3.11988   Max.   : 3.596784   Max.   : 2.84174  
##               NA's   :50                                               
##    speed_air           height             pitch             distance      
##  Min.   :-1.3847   Min.   :-2.47632   Min.   :-3.26772   Min.   :-1.6520  
##  1st Qu.:-0.7452   1st Qu.:-0.70804   1st Qu.:-0.69259   1st Qu.:-0.7020  
##  Median :-0.2430   Median :-0.02972   Median :-0.00783   Median :-0.2904  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.6029   3rd Qu.: 0.66905   3rd Qu.: 0.69323   3rd Qu.: 0.4620  
##  Max.   : 3.0223   Max.   : 3.01366   Max.   : 3.64933   Max.   : 4.3058  
##  NA's   :628
coeff14 <- rep(NA,length(variables))
p_val14 <- rep(NA,length(variables))


fit_all14 <- data.frame(variables, coeff14,p_val14)

for (i in seq_along(variables))
{
  fit_all14$coeff14[i] <- summary(lm(scaled_FAA_merged2$distance ~  scaled_FAA_merged2[[variables[i]]]))$coefficients[,1][2]
  fit_all14$p_val14[i] <- summary(lm(scaled_FAA_merged2$distance ~  scaled_FAA_merged2[[variables[i]]]))$coefficients[,4][2]
}

table3 <- fit_all14 %>% mutate(sign_coef14 = ifelse(coeff14>0,"+","-")) %>% select(-coeff14) %>% arrange(p_val14)
table3
##      variables       p_val14 sign_coef14
## 1 speed_ground 4.766371e-252           +
## 2    speed_air  2.500461e-97           +
## 3     aircraft  3.526194e-12           +
## 4       height  4.123860e-03           +
## 5        pitch  1.208124e-02           +
## 6     duration  1.514002e-01           -
## 7      no_pasg  6.092520e-01           -



Step 15: Comparing tables 1,2,3

Yes. The results are consistent. Speed_air, Speed_ground, and Speed_aircraft are the most important paramteters.

Step 16: Checking Collinearity

fit_all2.1 <- lm(distance ~ speed_ground , data = FAA_merged2)
fit_all2.2 <- lm(distance ~ speed_air , data = FAA_merged2)
fit_all2.3 <- lm(distance ~ speed_ground + speed_air , data = FAA_merged2)

# coeff for fit_all2.1 and pval for fit_all2.1 
summary(fit_all2.1)$coefficients[,1]
##  (Intercept) speed_ground 
##  -1773.94071     41.44219
summary(fit_all2.1)$coefficients[,4]
##   (Intercept)  speed_ground 
## 2.172875e-110 4.766371e-252
# coeff for fit_all2.2 and pval for fit_all2.2
summary(fit_all2.2)$coefficients[,1]
## (Intercept)   speed_air 
##  -5455.7088     79.5321
summary(fit_all2.2)$coefficients[,4]
##  (Intercept)    speed_air 
## 5.817937e-67 2.500461e-97
# coeff for fit_all2.3 and pval for fit_all2.3
summary(fit_all2.3)$coefficients[,1]
##  (Intercept) speed_ground    speed_air 
##  -5462.28328    -14.37343     93.95880
summary(fit_all2.3)$coefficients[,4]
##  (Intercept) speed_ground    speed_air 
## 6.587877e-67 2.584769e-01 6.990846e-12
# Checking correlation between speed_air and speed_ground
GGally::ggpairs(data = FAA_merged2, columns =4:5 )

When we individually regress landing distance on speed air and speed ground, both turn out to be significant with positive coefficients. However, when we regress ld on them together, the sign of the coefficient of speed ground changes and pval is larger than 0.05. This is possibly due to the very high correlation between the two variables (>98%). I would prefer to keep speed_ground in my data. This is because speed_air has a lot of missing values and have only 195 observations available. Speed_ground on the other hand has 781 observations.

Step 17: Calcultaing and plotting R sq for each model

r.sq_17 <-c()

r.sq_17[1] <- summary(lm(distance ~ speed_ground , data = FAA_merged2))$r.squared     
r.sq_17[2] <- summary(lm(distance ~ speed_ground + aircraft  , data = FAA_merged2))$r.squared 
r.sq_17[3] <- summary(lm(distance ~ speed_ground + aircraft + height, data = FAA_merged2))$r.squared
r.sq_17[4] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch , data = FAA_merged2))$r.squared
r.sq_17[5] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch + duration , data = FAA_merged2))$r.squared
r.sq_17[6] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch + duration + no_pasg, data = FAA_merged2))$r.squared

parameters <-c(1:6)

model_17 <- data.frame(r.sq_17,parameters)
model_17
##     r.sq_17 parameters
## 1 0.7503784          1
## 2 0.8251319          2
## 3 0.8488989          3
## 4 0.8493717          4
## 5 0.8504184          5
## 6 0.8506023          6
model_17 %>%  ggplot(aes(x=as.factor(parameters),y=r.sq_17)) + 
              geom_point(size=3) +
              xlab("Number of variables in model") +
              ylab("R sq Value") + 
              ggtitle("Effect of Addition of Variables in Linear Regression Model on R sq Values") + 
              theme(plot.title = element_text(hjust = 0.5))

We can see that Rsq keeps on increasing on addition of variables. However, after 3 variables, there is very small increase in r sq value. This implies that the addition of new variables brought no new information to make our model better. However, the model did become more complex.

Step 18: Calcultaing and plotiing Adj R sq for each model

adj.r.sq_18 <-c()

adj.r.sq_18[1] <- summary(lm(distance ~ speed_ground , data = FAA_merged2))$adj.r.squared     
adj.r.sq_18[2] <- summary(lm(distance ~ speed_ground + aircraft  , data = FAA_merged2))$adj.r.squared 
adj.r.sq_18[3] <- summary(lm(distance ~ speed_ground + aircraft + height, data = FAA_merged2))$adj.r.squared
adj.r.sq_18[4] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch , data = FAA_merged2))$adj.r.squared
adj.r.sq_18[5] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch + duration, data = FAA_merged2))$adj.r.squared
adj.r.sq_18[6] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch + duration + no_pasg, data = FAA_merged2))$adj.r.squared

parameters <-c(1:6)

model_18 <- data.frame(adj.r.sq_18,parameters)

model_18
##   adj.r.sq_18 parameters
## 1   0.7500773          1
## 2   0.8247095          2
## 3   0.8483508          3
## 4   0.8486423          4
## 5   0.8494534          5
## 6   0.8494442          6
model_18 %>%  ggplot(aes(x=as.factor(parameters),y=adj.r.sq_18)) + 
  geom_point(size=3) +
  xlab("Number of variables in model") +
  ylab("Adj. R sq Value") + 
  ggtitle("Effect of Addition of Variables in Linear Regression Model on Adj R sq Values") + 
  theme(plot.title = element_text(hjust = 0.5))

We can see thatthe Adj Rsq first increases and then decrease slightly on addition of variables. We observe this decrease after on adding the 3rd variable (height). This implies that the addition of new variables brought no new information to make our model better. Adj R Sq paramter penalizes the model when this happens.

Step 19: Calcultaing and plotiing AIC for each model

aic19 <- c()
aic19[1] <- AIC(lm(distance ~ speed_ground , data = FAA_merged2))
aic19[2] <- AIC(lm(distance ~ speed_ground + aircraft  , data = FAA_merged2))
aic19[3] <- AIC(lm(distance ~ speed_ground + aircraft + height, data = FAA_merged2))
aic19[4] <- AIC(lm(distance ~ speed_ground + aircraft + height + pitch, data = FAA_merged2))
aic19[5] <- AIC(lm(distance ~ speed_ground + aircraft + height + pitch + duration , data = FAA_merged2))
aic19[6] <- AIC(lm(distance ~ speed_ground + aircraft + height + pitch + duration + no_pasg, data = FAA_merged2))

aic19
## [1] 12508.81 12215.05 12095.65 12095.05 11378.84 11379.88
parameters <-c(1:6)
model_19 <- data.frame(aic19,parameters)
model_19 %>%  ggplot(aes(x=as.factor(parameters),y=aic19)) + 
  geom_point(size=3) +
  xlab("Number of variables in model") +
  ylab("AIC Value") + 
  ggtitle("Effect of Addition of Variables in Linear Regression Model on AIC Values") + 
  theme(plot.title = element_text(hjust = 0.5))

AIC decreases till model with 3 variables. Then its value remains almost the same (increases slightly. So, AIC suggests picking up the first 3 variables.

Step 20: Which vairbales to include?

Based on steps 17-19, I would pick the following variables:

  • Speed Ground

  • Aircraft

  • Height



Step 21: Variable selection based on automate algorithm

fit_base <- lm(distance ~ 1, data=FAA_merged2[,-5])
fit_max <- lm(distance~.,data=FAA_merged2[,-5])
require(MASS)
aic_for_fir <- stepAIC(fit_base, direction = 'forward', scope=list(upper=fit_max,lower=fit_base)) 
## Start:  AIC=11299.8
## distance ~ 1
## 
##                Df Sum of Sq       RSS   AIC
## + speed_ground  1 480561689 157699570 10104
## + aircraft      1  33759132 604502127 11220
## + height        1   6866417 631394842 11256
## + pitch         1   3010731 635250529 11262
## + duration      1   1685114 636576145 11263
## <none>                      638261260 11263
## + no_pasg       1    181284 638079976 11265
## 
## Step:  AIC=10148.53
## distance ~ speed_ground
## 
##            Df Sum of Sq       RSS     AIC
## + aircraft  1  47102191 110597379  9810.8
## + height    1  14123617 143575953 10027.6
## + pitch     1   8246571 149453000 10061.0
## <none>                  157699570 10103.6
## + no_pasg   1    154554 157545016 10104.8
## + duration  1     50570 157649000 10105.4
## 
## Step:  AIC=9854.77
## distance ~ speed_ground + aircraft
## 
##            Df Sum of Sq       RSS    AIC
## + height    1  15048298  95549081 9691.2
## <none>                  110597379 9810.8
## + pitch     1    182007 110415372 9811.4
## + no_pasg   1     41575 110555804 9812.5
## + duration  1      9394 110587985 9812.7
## 
## Step:  AIC=9735.37
## distance ~ speed_ground + aircraft + height
## 
##            Df Sum of Sq      RSS    AIC
## <none>                  95549081 9691.2
## + no_pasg   1    120379 95428702 9692.2
## + pitch     1     71174 95477907 9692.6
## + duration  1      4446 95544635 9693.2
summary(aic_for_fir)
## 
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height, data = FAA_merged2[, 
##     -5])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -711.95 -226.73  -90.17  130.04 1471.84 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2512.2433    68.1974  -36.84   <2e-16 ***
## speed_ground      42.4024     0.6483   65.41   <2e-16 ***
## aircraftboeing   496.0452    24.2975   20.41   <2e-16 ***
## height            14.1478     1.2405   11.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 349.1 on 827 degrees of freedom
## Multiple R-squared:  0.8489, Adjusted R-squared:  0.8484 
## F-statistic:  1549 on 3 and 827 DF,  p-value: < 2.2e-16

Variable selection algorithm gives a model with same variables as in the previous model selected manually.

Part II of the project involves using logistic regression to study the factors that impact the landing distance of a commercial flight and predicting conditions that would result in risky landings