Data on the Flight landing distances of two aircrafts is available in the “FAA1.xls”. The main objective of this statistical Analyses is to understand what features of the flight landing affect the landing distance of the aircraft.
Aircraft: Making of the aircraft (Boeing and Airbus)
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.
library(readxl)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.6.2
library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.2
## corrplot 0.84 loaded
flights <- read_excel("./FAA1.xls", col_names = T)
head(flights)
#Dimensions of the data
dim(flights)
## [1] 800 8
FAA1.xls has 800 records and 8 features. The variable that we are interested in is “distance”
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 800 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
unique(flights$aircraft)
## [1] "boeing" "airbus"
We observe that “aircraft” is the only categorical variable with 2 unique values “airbus” and “boeing”
The following section identifies column-wise missing value counts.
sapply(flights, function(x) sum(is.na(x)))
## aircraft duration no_pasg speed_ground speed_air
## 0 0 0 0 600
## height pitch distance
## 0 0 0
##Observation: “speed_air” has 600 missing values which is more than 75% of the available data.
table(duplicated(flights) | duplicated(flights, fromLast = TRUE))
##
## FALSE
## 800
Case 1: Duration < 40
sum(flights$duration < 40)
## [1] 5
Case 2: speed_ground < 30 or speed_ground > 140
sum(flights$speed_ground < 30 | flights$speed_ground > 140)
## [1] 3
Case 3: speed_air < 30 or speed_air > 140
nrow(subset(flights,flights$speed_air < 30 | flights$speed_air > 140 ))
## [1] 1
Case 4: Height < 6
sum(flights$height < 6)
## [1] 10
Case 5: Distance > 6000
sum(flights$distance > 6000)
## [1] 2
data <- sqldf('SELECT * FROM flights
WHERE duration < 40 OR
speed_ground < 30 OR speed_ground > 140 OR
speed_air < 30 OR speed_air > 140 OR
height < 6 OR
distance > 6000')
nrow(flights) - nrow(data)
## [1] 781
There are 19 abnormal records that have to be deleted from the actual dataset
flights %>% anti_join(data)
## Joining, by = c("aircraft", "duration", "no_pasg", "speed_ground", "speed_air", "height", "pitch", "distance")
Now we have 781 records to be explored further.
summary(flights)
## aircraft duration no_pasg speed_ground
## Length:800 Min. : 14.76 Min. :29.00 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.00 1st Qu.: 65.87
## Mode :character Median :153.95 Median :60.00 Median : 79.64
## Mean :154.01 Mean :60.13 Mean : 79.54
## 3rd Qu.:188.91 3rd Qu.:65.00 3rd Qu.: 92.33
## Max. :305.62 Max. :87.00 Max. :141.22
##
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.16 1st Qu.:23.338 1st Qu.:3.658 1st Qu.: 900.95
## Median :100.99 Median :30.147 Median :4.020 Median :1267.44
## Mean :103.83 Mean :30.122 Mean :4.018 Mean :1544.52
## 3rd Qu.:109.48 3rd Qu.:36.981 3rd Qu.:4.388 3rd Qu.:1960.44
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :600
Next, we analyze the distribution of the target variable and the correlations between the response and the other features in the dataset using pairwise plots and histograms.
df <- data.frame(select_if(flights, is.numeric))
suppressWarnings(
df %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scales = "free" ) +
geom_density()
)
## Warning: Removed 600 rows containing non-finite values (stat_density).
“distance” variable has a right-skewed distribution while the other features have approximately a normal distribution.
ggpairs(data = flights , title = "Flights Data Analysis",
mapping = ggplot2:: aes(color = aircraft),
lower = list(combo = wrap("facethist", binwidth = 1)))
## Warning: Removed 600 rows containing non-finite values (stat_boxplot).
## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values
## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values
## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values
## Warning: Removed 600 rows containing non-finite values (stat_bin).
## Warning: Removed 600 rows containing missing values (geom_point).
## Warning: Removed 600 rows containing missing values (geom_point).
## Warning: Removed 600 rows containing missing values (geom_point).
## Warning: Removed 600 rows containing non-finite values (stat_density).
## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values
## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values
## Warning in (function (data, mapping, alignPercent = 0.6, method =
## "pearson", : Removed 600 rows containing missing values
## Warning: Removed 600 rows containing missing values (geom_point).
## Warning: Removed 600 rows containing missing values (geom_point).
## Warning: Removed 600 rows containing missing values (geom_point).
chart.Correlation(df, histogram=TRUE, pch=19)
Based on the strength of the correlation between the response and the predictor variables, we choose “aircraft”, “pitch”, “height” and “speed_ground” for performing regression.
Note: “speed_air” shows strong correlation with the response variable. However, since it has 75% missing values, we do not use this as a regressor.
Model representation:
Distance= β_0+ β_1Speed_ground + β_2Height + β_3aircraft + β_4Pitch + ϵ
We fit an initial linear model based on the model representation shown above
fit_initial <- lm(distance ~ aircraft + height + pitch + speed_ground, data = flights)
summary(fit_initial)
##
## Call:
## lm(formula = distance ~ aircraft + height + pitch + speed_ground,
## data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -735.8 -226.5 -101.2 124.3 2139.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2699.3101 124.9986 -21.595 <2e-16 ***
## aircraftboeing 499.7709 28.2837 17.670 <2e-16 ***
## height 14.0535 1.2823 10.960 <2e-16 ***
## pitch 40.4663 26.9585 1.501 0.134
## speed_ground 42.8459 0.6856 62.490 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 372.2 on 795 degrees of freedom
## Multiple R-squared: 0.8434, Adjusted R-squared: 0.8426
## F-statistic: 1070 on 4 and 795 DF, p-value: < 2.2e-16
Interpretation of the results:
From the fitted results, we identify that aircraft type and pitch are less significant based on the p-values. We try a reduced model and perform ANOVA on the two models.
Reduced model 1:
Distance= β_0 + β_1*Speed_ground + β_2*Height + β_3*Pitch + ϵ
Reduced Model 2:
Distance= β_0 + β_1*Speed_ground + β_2*Height + β_3*aircraft + ϵ
ANOVA test results for Reduced Model 2 is shown below
fit_reduced_1 <- lm(distance ~ aircraft + height + speed_ground, data = flights)
summary(fit_reduced_1)
##
## Call:
## lm(formula = distance ~ aircraft + height + speed_ground, data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -731.00 -233.71 -97.02 130.88 2182.36
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2544.1153 70.3055 -36.19 <2e-16 ***
## aircraftboeing 515.1843 26.3746 19.53 <2e-16 ***
## height 14.1144 1.2826 11.00 <2e-16 ***
## speed_ground 42.8191 0.6859 62.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 372.5 on 796 degrees of freedom
## Multiple R-squared: 0.843, Adjusted R-squared: 0.8424
## F-statistic: 1424 on 3 and 796 DF, p-value: < 2.2e-16
anova(fit_reduced_1, fit_initial)
We infer that the result shows a Df of 1 (indicating that the more complex model has one additional parameter), and a large p-value (> .05). This means that adding the pitch to the model did not lead to a significantly improved fit over the model 1.
Thus, there is enough evidence to go with the Reduced Model 2. We do not omit “aircraft” as type of aircraft actually showed a difference in the IQR of the distance variable.
Final Model
*Distance= β_0 + β_1*Speed_ground + β_2*Height + β_3*aircraft + ϵ*
The residual plots of the selected model are plotted using the code given below.
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(fit_reduced_1) # Plot the model information
Inference:
Residual vs Fitted Values show a curvilinear trend. It is possible that one of the predictors are non-linearly related to the response variable.
In order to understand which variable is non-linear, we plot the correlations between the selected predictors and the response variable.
# Basic Scatterplot Matrix
pairs(~ speed_ground + height + distance ,data=flights , col=ifelse(flights$aircraft == "airbus", "blue", "red"), main="Scatterplot Matrix among numerical features")
It is clear that “speed_ground” has a quadratic relationship with “distance”.
Model correction
We add a quadratic term to the linear model and continue with the diagnostics.
Revised Model representation:
fit_quad <- lm(distance ~ aircraft + height + speed_ground + I(speed_ground^2), data = flights)
summary(fit_quad)
##
## Call:
## lm(formula = distance ~ aircraft + height + speed_ground + I(speed_ground^2),
## data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -488.55 -93.59 -4.13 92.56 414.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.797e+03 6.558e+01 27.41 <2e-16 ***
## aircraftboeing 4.013e+02 9.759e+00 41.12 <2e-16 ***
## height 1.359e+01 4.684e-01 29.02 <2e-16 ***
## speed_ground -6.925e+01 1.578e+00 -43.89 <2e-16 ***
## I(speed_ground^2) 6.937e-01 9.643e-03 71.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 136 on 795 degrees of freedom
## Multiple R-squared: 0.9791, Adjusted R-squared: 0.979
## F-statistic: 9307 on 4 and 795 DF, p-value: < 2.2e-16
Inferences:
Based on the above, we check if there is multicollinearity between the predictors using VIF.
car :: vif(fit_quad)
## aircraft height speed_ground I(speed_ground^2)
## 1.029739 1.000613 39.778706 39.702799
VIF values are very high for “speed_ground” and “speed_ground^2”. The following correlation plots also indicate the same.
ggplot(flights, aes(speed_ground, speed_ground^2)) + geom_point() + geom_smooth(method = "lm")
This multicollinearity is fixed by centering the newly added quadratic term around the mean.
The results of fitting the model with centered quadratic term is shown below.
avg_speed_ground <- mean(flights$speed_ground)
new_speed_ground <- flights$speed_ground - avg_speed_ground
fit_quad_new <- lm(distance ~ aircraft + height + speed_ground + I(new_speed_ground^2), data = flights)
summary(fit_quad_new)
##
## Call:
## lm(formula = distance ~ aircraft + height + speed_ground + I(new_speed_ground^2),
## data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -488.55 -93.59 -4.13 92.56 414.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.592e+03 2.568e+01 -100.94 <2e-16 ***
## aircraftboeing 4.013e+02 9.759e+00 41.12 <2e-16 ***
## height 1.359e+01 4.684e-01 29.02 <2e-16 ***
## speed_ground 4.111e+01 2.516e-01 163.41 <2e-16 ***
## I(new_speed_ground^2) 6.937e-01 9.643e-03 71.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 136 on 795 degrees of freedom
## Multiple R-squared: 0.9791, Adjusted R-squared: 0.979
## F-statistic: 9307 on 4 and 795 DF, p-value: < 2.2e-16
car :: vif(fit_quad_new)
## aircraft height speed_ground
## 1.029739 1.000613 1.011437
## I(new_speed_ground^2)
## 1.034718
The VIFs are well within tolerable limits and the model coefficients are significant.
Residual Diagnostics for the revised model
The residual plot of the final model is given below.
plot(fit_quad_new, which=1, col=c("blue")) # Residuals vs Fitted Plot
The residuals are fairly random around zero. Hence, we conclude that the final model is a good fit for the “flights” dataset.
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(fit_quad_new) # Plot the model information
Observation:
The residual plots of the new model with the quadratic term has the following desired characteristics: - Normally distributed - Mean approx. 0 - Constant variance with respect to the predicted value and the predictors.
Overall, we infer that the landing distance is chiefly influenced by the Speed_ground variable. Speed_air has many missing values. Without adequate data on speed_air variable, it is not possible to understand the impact on the distance. However, based on the quadratic model built, we infer that reducing the speed_ground can significantly decrease the landing distance. Other variables like Height and aircraft also play a role. If these three variables are maintained as optimally low as possible, landing over-run can be avoided greatly.