Introduction

Data on the Flight landing distances of two aircrafts is available in the “FAA1.xls”. The main objective of this statistical Analyses is to understand what features of the flight landing affect the landing distance of the aircraft.

Data Dictionary:

Aircraft: Making of the aircraft (Boeing and Airbus)

Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.

No_pasg: The number of passengers in a flight.

Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.

Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.

Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

library(readxl)
library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.6.2

library("PerformanceAnalytics")

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.6.2

## corrplot 0.84 loaded

flights <- read_excel("./FAA1.xls", col_names = T)
head(flights)

#Dimensions of the data
dim(flights)

## [1] 800   8

FAA1.xls has 800 records and 8 features. The variable that we are interested in is “distance”

str(flights)

## Classes 'tbl_df', 'tbl' and 'data.frame':    800 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

unique(flights$aircraft)

## [1] "boeing" "airbus"

We observe that “aircraft” is the only categorical variable with 2 unique values “airbus” and “boeing”

Identification of missing values

The following section identifies column-wise missing value counts.

sapply(flights, function(x) sum(is.na(x)))

##     aircraft     duration      no_pasg speed_ground    speed_air 
##            0            0            0            0          600 
##       height        pitch     distance 
##            0            0            0

##Observation: “speed_air” has 600 missing values which is more than 75% of the available data.

Identifying Duplicates and abnormalities in the data

table(duplicated(flights) | duplicated(flights, fromLast = TRUE))

## 
## FALSE 
##   800

Observation: There are no duplicate records

Abnormal data

Case 1: Duration < 40

sum(flights$duration < 40)

## [1] 5

Case 2: speed_ground < 30 or speed_ground > 140

sum(flights$speed_ground < 30 | flights$speed_ground > 140)

## [1] 3

Case 3: speed_air < 30 or speed_air > 140

nrow(subset(flights,flights$speed_air < 30 | flights$speed_air > 140 ))

## [1] 1

Case 4: Height < 6

sum(flights$height < 6)

## [1] 10

Case 5: Distance > 6000

sum(flights$distance > 6000)

## [1] 2

Data cleaning:

data <- sqldf('SELECT * FROM flights
              WHERE duration < 40 OR
              speed_ground < 30 OR speed_ground > 140 OR
              speed_air < 30 OR speed_air > 140 OR
              height < 6 OR
              distance > 6000')

nrow(flights) - nrow(data)

## [1] 781

There are 19 abnormal records that have to be deleted from the actual dataset

flights %>% anti_join(data)

## Joining, by = c("aircraft", "duration", "no_pasg", "speed_ground", "speed_air", "height", "pitch", "distance")

Now we have 781 records to be explored further.

Descriptive Study

summary(flights)

##    aircraft            duration         no_pasg       speed_ground   
##  Length:800         Min.   : 14.76   Min.   :29.00   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.00   1st Qu.: 65.87  
##  Mode  :character   Median :153.95   Median :60.00   Median : 79.64  
##                     Mean   :154.01   Mean   :60.13   Mean   : 79.54  
##                     3rd Qu.:188.91   3rd Qu.:65.00   3rd Qu.: 92.33  
##                     Max.   :305.62   Max.   :87.00   Max.   :141.22  
##                                                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.16   1st Qu.:23.338   1st Qu.:3.658   1st Qu.: 900.95  
##  Median :100.99   Median :30.147   Median :4.020   Median :1267.44  
##  Mean   :103.83   Mean   :30.122   Mean   :4.018   Mean   :1544.52  
##  3rd Qu.:109.48   3rd Qu.:36.981   3rd Qu.:4.388   3rd Qu.:1960.44  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :600

Next, we analyze the distribution of the target variable and the correlations between the response and the other features in the dataset using pairwise plots and histograms.

df <- data.frame(select_if(flights, is.numeric))

suppressWarnings(
  df %>%
  gather() %>%
  ggplot(aes(value)) + 
  facet_wrap(~key, scales = "free" ) +
  geom_density()
)

## Warning: Removed 600 rows containing non-finite values (stat_density).

“distance” variable has a right-skewed distribution while the other features have approximately a normal distribution.

Correlation Analysis

ggpairs(data = flights , title = "Flights Data Analysis",
                 mapping = ggplot2:: aes(color = aircraft),
                 lower = list(combo = wrap("facethist", binwidth = 1)))

## Warning: Removed 600 rows containing non-finite values (stat_boxplot).

## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values

## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values

## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values

## Warning: Removed 600 rows containing non-finite values (stat_bin).

## Warning: Removed 600 rows containing missing values (geom_point).

## Warning: Removed 600 rows containing missing values (geom_point).

## Warning: Removed 600 rows containing missing values (geom_point).

## Warning: Removed 600 rows containing non-finite values (stat_density).

## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values

## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values

## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values

## Warning: Removed 600 rows containing missing values (geom_point).

## Warning: Removed 600 rows containing missing values (geom_point).

## Warning: Removed 600 rows containing missing values (geom_point).

chart.Correlation(df, histogram=TRUE, pch=19)

Inferences:

For Aircraft Type = Boeing, Distance and Duration have outliers
Distance has strong correlation with speed_ground
Distance has moderate correlation with Height and Pitch
No. of Boeing aircrafts > No. if Airbus aircrafts in the data
IQR value of Distance for Boeing > IQR value of distance for Airbus.

Model Building

Based on the strength of the correlation between the response and the predictor variables, we choose “aircraft”, “pitch”, “height” and “speed_ground” for performing regression.

Note: “speed_air” shows strong correlation with the response variable. However, since it has 75% missing values, we do not use this as a regressor.

Model representation:

Distance= β_0+ β_1Speed_ground + β_2Height + β_3aircraft + β_4Pitch + ϵ

Model Building and Evaluation

We fit an initial linear model based on the model representation shown above

fit_initial <- lm(distance ~ aircraft + height + pitch + speed_ground, data = flights)
summary(fit_initial)

## 
## Call:
## lm(formula = distance ~ aircraft + height + pitch + speed_ground, 
##     data = flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -735.8 -226.5 -101.2  124.3 2139.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2699.3101   124.9986 -21.595   <2e-16 ***
## aircraftboeing   499.7709    28.2837  17.670   <2e-16 ***
## height            14.0535     1.2823  10.960   <2e-16 ***
## pitch             40.4663    26.9585   1.501    0.134    
## speed_ground      42.8459     0.6856  62.490   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 372.2 on 795 degrees of freedom
## Multiple R-squared:  0.8434, Adjusted R-squared:  0.8426 
## F-statistic:  1070 on 4 and 795 DF,  p-value: < 2.2e-16

Interpretation of the results:

From the fitted results, we identify that aircraft type and pitch are less significant based on the p-values. We try a reduced model and perform ANOVA on the two models.

Reduced model 1:

Distance= β_0 + β_1*Speed_ground + β_2*Height + β_3*Pitch + ϵ

Reduced Model 2:

Distance= β_0 + β_1*Speed_ground + β_2*Height + β_3*aircraft + ϵ

ANOVA test results for Reduced Model 2 is shown below

fit_reduced_1 <- lm(distance ~ aircraft + height + speed_ground, data = flights)


summary(fit_reduced_1)

## 
## Call:
## lm(formula = distance ~ aircraft + height + speed_ground, data = flights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -731.00 -233.71  -97.02  130.88 2182.36 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2544.1153    70.3055  -36.19   <2e-16 ***
## aircraftboeing   515.1843    26.3746   19.53   <2e-16 ***
## height            14.1144     1.2826   11.00   <2e-16 ***
## speed_ground      42.8191     0.6859   62.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 372.5 on 796 degrees of freedom
## Multiple R-squared:  0.843,  Adjusted R-squared:  0.8424 
## F-statistic:  1424 on 3 and 796 DF,  p-value: < 2.2e-16

anova(fit_reduced_1, fit_initial)

We infer that the result shows a Df of 1 (indicating that the more complex model has one additional parameter), and a large p-value (> .05). This means that adding the pitch to the model did not lead to a significantly improved fit over the model 1.

Thus, there is enough evidence to go with the Reduced Model 2. We do not omit “aircraft” as type of aircraft actually showed a difference in the IQR of the distance variable.

Final Model

*Distance= β_0 + β_1*Speed_ground + β_2*Height + β_3*aircraft + ϵ*

Residual Diagnostics

The residual plots of the selected model are plotted using the code given below.

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(fit_reduced_1)  # Plot the model information

Inference:

Residual vs Fitted Values show a curvilinear trend. It is possible that one of the predictors are non-linearly related to the response variable.

In order to understand which variable is non-linear, we plot the correlations between the selected predictors and the response variable.

# Basic Scatterplot Matrix
pairs(~  speed_ground + height + distance ,data=flights , col=ifelse(flights$aircraft == "airbus", "blue", "red"), main="Scatterplot Matrix among numerical features")

It is clear that “speed_ground” has a quadratic relationship with “distance”.

Model correction

We add a quadratic term to the linear model and continue with the diagnostics.

Revised Model representation:

Distance= β_0 + β_1Speed_ground + β_2Height + β_3aircraft + β_4〖Speed〗_ground^2+ ϵ*

fit_quad <- lm(distance ~ aircraft + height + speed_ground + I(speed_ground^2), data = flights)
summary(fit_quad)

## 
## Call:
## lm(formula = distance ~ aircraft + height + speed_ground + I(speed_ground^2), 
##     data = flights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -488.55  -93.59   -4.13   92.56  414.97 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.797e+03  6.558e+01   27.41   <2e-16 ***
## aircraftboeing     4.013e+02  9.759e+00   41.12   <2e-16 ***
## height             1.359e+01  4.684e-01   29.02   <2e-16 ***
## speed_ground      -6.925e+01  1.578e+00  -43.89   <2e-16 ***
## I(speed_ground^2)  6.937e-01  9.643e-03   71.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 136 on 795 degrees of freedom
## Multiple R-squared:  0.9791, Adjusted R-squared:  0.979 
## F-statistic:  9307 on 4 and 795 DF,  p-value: < 2.2e-16

Inferences:

All the variables have p-value less than 0.05
The coefficient of speed_ground has changed direction. In the original model, it was positively correlated but in the revised model, it is negative.

Based on the above, we check if there is multicollinearity between the predictors using VIF.

car :: vif(fit_quad)

##          aircraft            height      speed_ground I(speed_ground^2) 
##          1.029739          1.000613         39.778706         39.702799

VIF values are very high for “speed_ground” and “speed_ground^2”. The following correlation plots also indicate the same.

ggplot(flights, aes(speed_ground, speed_ground^2)) + geom_point() + geom_smooth(method = "lm")

This multicollinearity is fixed by centering the newly added quadratic term around the mean.

The results of fitting the model with centered quadratic term is shown below.

avg_speed_ground <- mean(flights$speed_ground)
new_speed_ground <- flights$speed_ground - avg_speed_ground

fit_quad_new <- lm(distance ~ aircraft + height + speed_ground + I(new_speed_ground^2), data = flights)

summary(fit_quad_new)

## 
## Call:
## lm(formula = distance ~ aircraft + height + speed_ground + I(new_speed_ground^2), 
##     data = flights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -488.55  -93.59   -4.13   92.56  414.97 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.592e+03  2.568e+01 -100.94   <2e-16 ***
## aircraftboeing         4.013e+02  9.759e+00   41.12   <2e-16 ***
## height                 1.359e+01  4.684e-01   29.02   <2e-16 ***
## speed_ground           4.111e+01  2.516e-01  163.41   <2e-16 ***
## I(new_speed_ground^2)  6.937e-01  9.643e-03   71.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 136 on 795 degrees of freedom
## Multiple R-squared:  0.9791, Adjusted R-squared:  0.979 
## F-statistic:  9307 on 4 and 795 DF,  p-value: < 2.2e-16

car :: vif(fit_quad_new)

##              aircraft                height          speed_ground 
##              1.029739              1.000613              1.011437 
## I(new_speed_ground^2) 
##              1.034718

The VIFs are well within tolerable limits and the model coefficients are significant.

Residual Diagnostics for the revised model

The residual plot of the final model is given below.

plot(fit_quad_new, which=1, col=c("blue")) # Residuals vs Fitted Plot

The residuals are fairly random around zero. Hence, we conclude that the final model is a good fit for the “flights” dataset.

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(fit_quad_new)  # Plot the model information

Observation:

The residual plots of the new model with the quadratic term has the following desired characteristics: - Normally distributed - Mean approx. 0 - Constant variance with respect to the predicted value and the predictors.

Conclusion

Overall, we infer that the landing distance is chiefly influenced by the Speed_ground variable. Speed_air has many missing values. Without adequate data on speed_air variable, it is not possible to understand the impact on the distance. However, based on the quadratic model built, we infer that reducing the speed_ground can significantly decrease the landing distance. Other variables like Height and aircraft also play a role. If these three variables are maintained as optimally low as possible, landing over-run can be avoided greatly.

R Notebook for Exploratory Analysis of Flights Data