Problem Statement:
The objective of this project is to study the factors that impact the landing distance of commercial flights and build a Linear Regression model to predict the risk of overrun
Approach:
For the analysis, we have Landing data(landing distance and other parameters) of 800 commercial flights coming from data file “FAA1.xls”. Our aim is to find a suitable linear model to predict the safe flight landing distance by choosing apt predictors from the variables in the dataset. In our study, we will analyze the relation of the predictors with the response variable, how they effects the response variable and thus selecting the relevant predictors for building our model.
Importing and exploring the data:
We used the following packages to arrive at our recommendations:
There are 800 observations and 8 variables in our data file. As we can see, that in the starting records only we have some missing values. We will check our dataset to get an idea about the missing values.
library(readxl)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(psych)
library(funModeling)
#Importing Dataset(Excel)
FAA1 <- read_excel("C:/Users/plash/Desktop/FAA1.xls")
#Checking Structure
str(FAA1)
names(FAA1)
dim(FAA1)
Describing and Checking the dataset:
It can be inferred easily that the column speed_air has around 600 missing values which accounts for 75% of the data. We will further check for invalid values not corresponding to the validations imposed on the dataset.
Further analysis shows that speed_air has only 200 values. Also, an alarming thing to notice here is that height has minimum value in negative which is inadmissible. This ensures that our dataset has abnormalities.
summary(FAA1)
## aircraft duration no_pasg speed_ground
## Length:800 Min. : 14.76 Min. :29.00 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.00 1st Qu.: 65.87
## Mode :character Median :153.95 Median :60.00 Median : 79.64
## Mean :154.01 Mean :60.13 Mean : 79.54
## 3rd Qu.:188.91 3rd Qu.:65.00 3rd Qu.: 92.33
## Max. :305.62 Max. :87.00 Max. :141.22
##
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.16 1st Qu.:23.338 1st Qu.:3.658 1st Qu.: 900.95
## Median :100.99 Median :30.147 Median :4.020 Median :1267.44
## Mean :103.83 Mean :30.122 Mean :4.018 Mean :1544.52
## 3rd Qu.:109.48 3rd Qu.:36.981 3rd Qu.:4.388 3rd Qu.:1960.44
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :600
colSums(is.na(FAA1))
## aircraft duration no_pasg speed_ground speed_air height
## 0 0 0 0 600 0
## pitch distance
## 0 0
Finding the abnormalities:
Adding a quality column respective to all the variables to get the abnormal records
Applied data/range validation on the variable columns – duration, speed_ground, speed_air, height and distance. Legend used:- • Null – Missing Values • V – Valid Values • IV – Invalid Values
#Checking invalid and outlier observations
##1. Duration Validation
FAA1 <- FAA1 %>% mutate(Dur_quality = case_when(is.na(duration) ~ "Null", + duration<40 ~ "IV",TRUE ~ "V"))
##2. Ground Speed Validation
FAA1 <- FAA1 %>% mutate(SpGr_quality = case_when(is.na(speed_ground) ~ "Null", + speed_ground<30 | speed_ground>140 ~ "IV",TRUE ~ "V"))
##3. Air Speed Validation
FAA1 <- FAA1 %>% mutate(SpAir_quality = case_when(is.na(speed_air) ~ "Null", + speed_ground<30 | speed_ground>140 ~ "IV",TRUE ~ "V"))
##4. Height Validation
FAA1 <- FAA1 %>% mutate(Height_quality = case_when(is.na(height) ~ "Null", + height<6 ~ "IV",TRUE ~ "V"))
##5. Distance Validation
FAA1 <- FAA1 %>% mutate(Dis_quality = case_when(is.na(distance) ~ "Null", + distance>6000 ~ "IV",TRUE ~ "V"))
Inference –
There are 21 abnormal values found in the given data, height having the most invalid records. We will delete the abnormalities before proceeding further. 600 missing values in Speed_air column is an unavoidable case and hence we can’t go and remove all these 600 records. We would distort the quality of our data. We would replace the missing values with 0 so that we can get the summary statistics of the variable.
# Count of abnormal records
table(FAA1$SpAir_quality)
table(FAA1$Dis_quality)
table(FAA1$Dur_quality)
table(FAA1$Height_quality)
table(FAA1$SpGr_quality)
table(FAA1$SpAir_quality)
Data Cleaning :
We are left with 781 observations after deleting the abnormalities. Now there are 13 variables since we added 5 data validation(quality) columns. We will remove these extra columns as well since now we have only valid and null records in our updated dataset.
FAA is our final dataset which we will use for further descriptive study and modeling.
library(tidyr)
FAA1<-FAA1 %>% mutate(speed_air = replace_na(speed_air, 0))
#Deleting the abnormalities
FAA<- FAA1[!(FAA1$Dur_quality=="IV" | FAA1$SpGr_quality=="IV" | FAA1$SpAir_quality=="IV"
| FAA1$Height_quality=="IV" | FAA1$Dis_quality=="IV"),]
dim(FAA)
Summary statistics of the final dataset and plotting the distributions :
As we can see, after deleting the invalid records, there is not much significant difference in the statistical summaries of dataset before and after removing the abnormalities. Hence, we are not compromising with the quality of the dataset.
FAA <- select(FAA, -c(9:13))
dim(FAA)
## [1] 781 8
summary(FAA)
## aircraft duration no_pasg speed_ground
## Length:781 Min. : 41.95 Min. :29.00 Min. : 33.57
## Class :character 1st Qu.:119.63 1st Qu.:55.00 1st Qu.: 66.19
## Mode :character Median :154.28 Median :60.00 Median : 79.79
## Mean :154.78 Mean :60.08 Mean : 79.64
## 3rd Qu.:189.66 3rd Qu.:65.00 3rd Qu.: 92.13
## Max. :305.62 Max. :87.00 Max. :132.78
## speed_air height pitch distance
## Min. : 0.00 Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.: 0.00 1st Qu.:23.594 1st Qu.:3.653 1st Qu.: 919.05
## Median : 0.00 Median :30.217 Median :4.014 Median :1273.66
## Mean : 25.84 Mean :30.455 Mean :4.014 Mean :1541.20
## 3rd Qu.: 0.00 3rd Qu.:36.988 3rd Qu.:4.382 3rd Qu.:1960.43
## Max. :132.91 Max. :59.946 Max. :5.927 Max. :5381.96
plot_num(FAA)
We have explained the descriptive characteristics of the dataset and its variables. From EDA , we observed that the variables have normal distribution. To proceed further, we analyzed the inter-relation(correlation) between variables and landing distance. Landing distance(distance) is our response variable and it is plotted on y-axis always.
Statistical analysis of the plots between different variables :
A positive linear relationship can be observed here between variables speed_ground and speed_air. It implies that Speed_ground can be an important factor of landing distance. We have to explore the relationship with this variable more using correlation to determine whether to consider this as one of the predictors while building our model. Since there are 600 missing values in speed_air , this variable will not be adequate as a predictor for our model.
#Statistical analysis of the XY plots between different variables with distance
par(mfrow = c(2, 3))
plot(FAA$distance ~ FAA$no_pasg)
plot(FAA$distance ~ FAA$speed_ground)
plot(FAA$distance ~FAA$speed_air)
plot(FAA$distance ~ FAA$height)
plot(FAA$distance~ FAA$pitch)
plot(FAA$distance ~ FAA$duration)
Correlation :
• Speed_ground and speed_air shows strong positive correlation with distance. • Other variables have very less (insignificant) correlation so it won’t be of no use to include all of these variables in our model, as their contribution will be trivial. • As mentioned earlier, speed_air variable has only 25%(203) values in the dataset. • Hence, we will use speed_ground as our base for regression analysis, using speed_air won’t be a wise decision.
#Computing correlation
NFAA <- FAA[, sapply(FAA, is.numeric)]
#Type casting
cor(NFAA)
## duration no_pasg speed_ground speed_air height
## duration 1.00000000 -0.036389581 -0.048970252 -0.04377771 0.01111923
## no_pasg -0.03638958 1.000000000 -0.001489012 -0.01953852 0.03730883
## speed_ground -0.04897025 -0.001489012 1.000000000 0.75096986 -0.05167181
## speed_air -0.04377771 -0.019538523 0.750969860 1.00000000 -0.01021914
## height 0.01111923 0.037308828 -0.051671805 -0.01021914 1.00000000
## pitch -0.04675348 -0.014447586 -0.051670337 0.02898776 0.03473959
## distance -0.05138252 -0.016853121 0.867711454 0.83576653 0.10372080
## pitch distance
## duration -0.04675348 -0.05138252
## no_pasg -0.01444759 -0.01685312
## speed_ground -0.05167034 0.86771145
## speed_air 0.02898776 0.83576653
## height 0.03473959 0.10372080
## pitch 1.00000000 0.06868102
## distance 0.06868102 1.00000000
From the previous analysis , we found that Speed_ground is a pertinent predictor to predict our response variable, landing distance.
CASE 1 - One with only speed_ground as predictor
Response Variable, Flight landing distance, can be predicted using the following regression equation - d𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑦)=41.54∗𝑠𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑 – 1766.76
fit<- lm(FAA$distance ~ FAA$speed_ground)
fit
##
## Call:
## lm(formula = FAA$distance ~ FAA$speed_ground)
##
## Coefficients:
## (Intercept) FAA$speed_ground
## -1766.76 41.54
summary(fit)
##
## Call:
## lm(formula = FAA$distance ~ FAA$speed_ground)
##
## Residuals:
## Min 1Q Median 3Q Max
## -911.35 -318.91 -76.71 217.15 1779.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1766.7573 69.7769 -25.32 <2e-16 ***
## FAA$speed_ground 41.5366 0.8525 48.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 449.9 on 779 degrees of freedom
## Multiple R-squared: 0.7529, Adjusted R-squared: 0.7526
## F-statistic: 2374 on 1 and 779 DF, p-value: < 2.2e-16
Case 2 – One with speed_ground, height as predictor
Response Variable, Flight landing distance, can be predicted using the following regression equation - d𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑦)=41.90∗𝑠𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑 + 13.83*height – 2217.743
#2. speed_ground, height and aircraft
fit1 <- lm(FAA$distance ~ FAA$speed_ground + FAA$height)
fit1
##
## Call:
## lm(formula = FAA$distance ~ FAA$speed_ground + FAA$height)
##
## Coefficients:
## (Intercept) FAA$speed_ground FAA$height
## -2217.43 41.90 13.83
summary(fit1)
##
## Call:
## lm(formula = FAA$distance ~ FAA$speed_ground + FAA$height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -756.57 -326.89 -55.22 177.93 1744.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2217.4283 84.2159 -26.330 <2e-16 ***
## FAA$speed_ground 41.9050 0.8151 51.414 <2e-16 ***
## FAA$height 13.8345 1.5814 8.748 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 429.6 on 778 degrees of freedom
## Multiple R-squared: 0.7751, Adjusted R-squared: 0.7745
## F-statistic: 1340 on 2 and 778 DF, p-value: < 2.2e-16
Conclusion
We can see that R2adj value has got increased with the addition of height as a predictor. We know that this parameter only get increase with the addition of a relevant predictor in the model.
Also, the Q-Q plot for the case 2 model shows that residuals are normally distributed, confirming the fitness of the latter model. Thus, Case 2 with speed_ground and height variables to predict the response variable, flight landing distance is a good fit for our model. The better linear model equation to predict the flight landing distance(y) –
d𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑦)=41.90∗𝑠𝑝𝑒𝑒𝑑_𝑔𝑟𝑜𝑢𝑛𝑑 + 13.83*height – 2217.743
residuals <- fit1$res
par(mfrow=c(1,2))
plot(FAA$speed_ground,residuals)
abline(h=c(-2,0,2),lty=2)
qqnorm(residuals)
abline(0,1)