Flight Landing Distance Analysis

Introduction

Problem Statement:
The objective of this project is to study the factors that impact the landing distance of commercial flights and build a Linear Regression model to predict the risk of overrun

Approach:

For the analysis, we have Landing data(landing distance and other parameters) of 800 commercial flights coming from data file “FAA1.xls”. Our aim is to find a suitable linear model to predict the safe flight landing distance by choosing apt predictors from the variables in the dataset. In our study, we will analyze the relation of the predictors with the response variable, how they effects the response variable and thus selecting the relevant predictors for building our model.

Data Exploration

Importing and exploring the data:

We used the following packages to arrive at our recommendations:

  • tidyverse : Used in data processing and data transformation as well as for data visualization
  • readxl : Used for importing data files
  • ggplot2 : Used for plots

There are 800 observations and 8 variables in our data file. As we can see, that in the starting records only we have some missing values. We will check our dataset to get an idea about the missing values.

library(readxl)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(psych)
library(funModeling)


#Importing Dataset(Excel)
FAA1 <- read_excel("C:/Users/plash/Desktop/FAA1.xls") 

#Checking Structure
str(FAA1)

names(FAA1)

dim(FAA1)

Describing and Checking the dataset:

It can be inferred easily that the column speed_air has around 600 missing values which accounts for 75% of the data. We will further check for invalid values not corresponding to the validations imposed on the dataset.

Further analysis shows that speed_air has only 200 values. Also, an alarming thing to notice here is that height has minimum value in negative which is inadmissible. This ensures that our dataset has abnormalities.

summary(FAA1)
##    aircraft            duration         no_pasg       speed_ground   
##  Length:800         Min.   : 14.76   Min.   :29.00   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.00   1st Qu.: 65.87  
##  Mode  :character   Median :153.95   Median :60.00   Median : 79.64  
##                     Mean   :154.01   Mean   :60.13   Mean   : 79.54  
##                     3rd Qu.:188.91   3rd Qu.:65.00   3rd Qu.: 92.33  
##                     Max.   :305.62   Max.   :87.00   Max.   :141.22  
##                                                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.16   1st Qu.:23.338   1st Qu.:3.658   1st Qu.: 900.95  
##  Median :100.99   Median :30.147   Median :4.020   Median :1267.44  
##  Mean   :103.83   Mean   :30.122   Mean   :4.018   Mean   :1544.52  
##  3rd Qu.:109.48   3rd Qu.:36.981   3rd Qu.:4.388   3rd Qu.:1960.44  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :600
colSums(is.na(FAA1))
##     aircraft     duration      no_pasg speed_ground    speed_air       height 
##            0            0            0            0          600            0 
##        pitch     distance 
##            0            0

Finding the abnormalities:

Adding a quality column respective to all the variables to get the abnormal records

Applied data/range validation on the variable columns – duration, speed_ground, speed_air, height and distance. Legend used:- • Null – Missing Values • V – Valid Values • IV – Invalid Values

#Checking invalid and outlier observations 
##1. Duration Validation 
FAA1 <- FAA1 %>% mutate(Dur_quality = case_when(is.na(duration) ~ "Null", + duration<40 ~ "IV",TRUE ~ "V"))  

##2. Ground Speed Validation 
FAA1 <- FAA1 %>% mutate(SpGr_quality = case_when(is.na(speed_ground) ~ "Null", + speed_ground<30 | speed_ground>140 ~ "IV",TRUE ~ "V")) 

##3. Air Speed Validation 
FAA1 <- FAA1 %>% mutate(SpAir_quality = case_when(is.na(speed_air) ~ "Null", + speed_ground<30 | speed_ground>140 ~ "IV",TRUE ~ "V")) 

##4. Height Validation 
FAA1 <- FAA1 %>% mutate(Height_quality = case_when(is.na(height) ~ "Null", + height<6 ~ "IV",TRUE ~ "V"))

##5. Distance Validation 
FAA1 <- FAA1 %>% mutate(Dis_quality = case_when(is.na(distance) ~ "Null", + distance>6000 ~ "IV",TRUE ~ "V"))

Inference –

There are 21 abnormal values found in the given data, height having the most invalid records. We will delete the abnormalities before proceeding further. 600 missing values in Speed_air column is an unavoidable case and hence we can’t go and remove all these 600 records. We would distort the quality of our data. We would replace the missing values with 0 so that we can get the summary statistics of the variable.

# Count of abnormal records

table(FAA1$SpAir_quality)
table(FAA1$Dis_quality)
table(FAA1$Dur_quality)
table(FAA1$Height_quality)
table(FAA1$SpGr_quality)
table(FAA1$SpAir_quality)

Data Cleaning :
We are left with 781 observations after deleting the abnormalities. Now there are 13 variables since we added 5 data validation(quality) columns. We will remove these extra columns as well since now we have only valid and null records in our updated dataset.
FAA is our final dataset which we will use for further descriptive study and modeling.

library(tidyr)
FAA1<-FAA1 %>% mutate(speed_air = replace_na(speed_air, 0))

#Deleting the abnormalities
FAA<- FAA1[!(FAA1$Dur_quality=="IV" | FAA1$SpGr_quality=="IV" | FAA1$SpAir_quality=="IV"
           | FAA1$Height_quality=="IV" | FAA1$Dis_quality=="IV"),]
dim(FAA)

Summary statistics of the final dataset and plotting the distributions :

As we can see, after deleting the invalid records, there is not much significant difference in the statistical summaries of dataset before and after removing the abnormalities. Hence, we are not compromising with the quality of the dataset.

FAA <- select(FAA, -c(9:13)) 
dim(FAA)
## [1] 781   8
summary(FAA)
##    aircraft            duration         no_pasg       speed_ground   
##  Length:781         Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
##  Class :character   1st Qu.:119.63   1st Qu.:55.00   1st Qu.: 66.19  
##  Mode  :character   Median :154.28   Median :60.00   Median : 79.79  
##                     Mean   :154.78   Mean   :60.08   Mean   : 79.64  
##                     3rd Qu.:189.66   3rd Qu.:65.00   3rd Qu.: 92.13  
##                     Max.   :305.62   Max.   :87.00   Max.   :132.78  
##    speed_air          height           pitch          distance      
##  Min.   :  0.00   Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
##  1st Qu.:  0.00   1st Qu.:23.594   1st Qu.:3.653   1st Qu.: 919.05  
##  Median :  0.00   Median :30.217   Median :4.014   Median :1273.66  
##  Mean   : 25.84   Mean   :30.455   Mean   :4.014   Mean   :1541.20  
##  3rd Qu.:  0.00   3rd Qu.:36.988   3rd Qu.:4.382   3rd Qu.:1960.43  
##  Max.   :132.91   Max.   :59.946   Max.   :5.927   Max.   :5381.96
plot_num(FAA)

Descriptive Analysis

We have explained the descriptive characteristics of the dataset and its variables. From EDA , we observed that the variables have normal distribution. To proceed further, we analyzed the inter-relation(correlation) between variables and landing distance. Landing distance(distance) is our response variable and it is plotted on y-axis always.

Statistical analysis of the plots between different variables :

A positive linear relationship can be observed here between variables speed_ground and speed_air. It implies that Speed_ground can be an important factor of landing distance. We have to explore the relationship with this variable more using correlation to determine whether to consider this as one of the predictors while building our model. Since there are 600 missing values in speed_air , this variable will not be adequate as a predictor for our model.

#Statistical analysis of the XY plots between different variables with distance
par(mfrow = c(2, 3))
plot(FAA$distance ~ FAA$no_pasg)
plot(FAA$distance ~ FAA$speed_ground)
plot(FAA$distance ~FAA$speed_air)
plot(FAA$distance ~ FAA$height)
plot(FAA$distance~ FAA$pitch)
plot(FAA$distance ~ FAA$duration)

Correlation :

• Speed_ground and speed_air shows strong positive correlation with distance. • Other variables have very less (insignificant) correlation so it won’t be of no use to include all of these variables in our model, as their contribution will be trivial. • As mentioned earlier, speed_air variable has only 25%(203) values in the dataset. • Hence, we will use speed_ground as our base for regression analysis, using speed_air won’t be a wise decision.

#Computing correlation 
NFAA <- FAA[, sapply(FAA, is.numeric)] 
#Type casting 
cor(NFAA)
##                 duration      no_pasg speed_ground   speed_air      height
## duration      1.00000000 -0.036389581 -0.048970252 -0.04377771  0.01111923
## no_pasg      -0.03638958  1.000000000 -0.001489012 -0.01953852  0.03730883
## speed_ground -0.04897025 -0.001489012  1.000000000  0.75096986 -0.05167181
## speed_air    -0.04377771 -0.019538523  0.750969860  1.00000000 -0.01021914
## height        0.01111923  0.037308828 -0.051671805 -0.01021914  1.00000000
## pitch        -0.04675348 -0.014447586 -0.051670337  0.02898776  0.03473959
## distance     -0.05138252 -0.016853121  0.867711454  0.83576653  0.10372080
##                    pitch    distance
## duration     -0.04675348 -0.05138252
## no_pasg      -0.01444759 -0.01685312
## speed_ground -0.05167034  0.86771145
## speed_air     0.02898776  0.83576653
## height        0.03473959  0.10372080
## pitch         1.00000000  0.06868102
## distance      0.06868102  1.00000000

Statistical Modeling and Conclusion

From the previous analysis , we found that Speed_ground is a pertinent predictor to predict our response variable, landing distance.

CASE 1 - One with only speed_ground as predictor

Response Variable, Flight landing distance, can be predicted using the following regression equation - d𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑦)=41.54∗𝑠𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑 – 1766.76

fit<- lm(FAA$distance ~ FAA$speed_ground) 
fit
## 
## Call:
## lm(formula = FAA$distance ~ FAA$speed_ground)
## 
## Coefficients:
##      (Intercept)  FAA$speed_ground  
##         -1766.76             41.54
summary(fit)
## 
## Call:
## lm(formula = FAA$distance ~ FAA$speed_ground)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -911.35 -318.91  -76.71  217.15 1779.50 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1766.7573    69.7769  -25.32   <2e-16 ***
## FAA$speed_ground    41.5366     0.8525   48.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 449.9 on 779 degrees of freedom
## Multiple R-squared:  0.7529, Adjusted R-squared:  0.7526 
## F-statistic:  2374 on 1 and 779 DF,  p-value: < 2.2e-16

Case 2 – One with speed_ground, height as predictor

Response Variable, Flight landing distance, can be predicted using the following regression equation - d𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑦)=41.90∗𝑠𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑 + 13.83*height – 2217.743

#2. speed_ground, height and aircraft 
fit1 <- lm(FAA$distance ~ FAA$speed_ground + FAA$height) 
fit1
## 
## Call:
## lm(formula = FAA$distance ~ FAA$speed_ground + FAA$height)
## 
## Coefficients:
##      (Intercept)  FAA$speed_ground        FAA$height  
##         -2217.43             41.90             13.83
summary(fit1)
## 
## Call:
## lm(formula = FAA$distance ~ FAA$speed_ground + FAA$height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -756.57 -326.89  -55.22  177.93 1744.82 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2217.4283    84.2159 -26.330   <2e-16 ***
## FAA$speed_ground    41.9050     0.8151  51.414   <2e-16 ***
## FAA$height          13.8345     1.5814   8.748   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 429.6 on 778 degrees of freedom
## Multiple R-squared:  0.7751, Adjusted R-squared:  0.7745 
## F-statistic:  1340 on 2 and 778 DF,  p-value: < 2.2e-16

Conclusion

We can see that R2adj value has got increased with the addition of height as a predictor. We know that this parameter only get increase with the addition of a relevant predictor in the model.

Also, the Q-Q plot for the case 2 model shows that residuals are normally distributed, confirming the fitness of the latter model. Thus, Case 2 with speed_ground and height variables to predict the response variable, flight landing distance is a good fit for our model. The better linear model equation to predict the flight landing distance(y) –

d𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑦)=41.90∗𝑠𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑 + 13.83*height – 2217.743

residuals <- fit1$res
par(mfrow=c(1,2))
plot(FAA$speed_ground,residuals)
abline(h=c(-2,0,2),lty=2)
qqnorm(residuals)
abline(0,1)