Background: Flight landing.
Motivation: To reduce the risk of landing overrun.
Goal: To study what factors and how they would impact the landing distance of a commercial flight - using linear regression
Data set: Landing data from 950 commercial flights
Variables:
Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.
library(tidyverse) #to visualize, transform, input, tidy and join data
library(haven) #to input data from SAS
library(dplyr) #data wrangling
library(stringr) #string related functions
library(kableExtra) #to create HTML Table
library(DT) #to preview the data sets
library(ggplot2) #to vizualize data
FAA1<-as_tibble(read.csv("FAA1.csv",header=T))
FAA2<-as_tibble(read.csv("FAA2.csv",header=T))
head(FAA1)
## # A tibble: 6 x 8
## aircraft duration no_pasg speed_ground speed_air height pitch distance
## <fct> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 boeing 98.5 53 108. 109. 27.4 4.04 3370.
## 2 boeing 126. 69 102. 103. 27.8 4.12 2988.
## 3 boeing 112. 61 71.1 NA 18.6 4.43 1145.
## 4 boeing 197. 56 85.8 NA 30.7 3.88 1664.
## 5 boeing 90.1 70 59.9 NA 32.4 4.03 1050.
## 6 boeing 138. 55 75.0 NA 41.2 4.20 1627.
head(FAA2)
## # A tibble: 6 x 7
## aircraft no_pasg speed_ground speed_air height pitch distance
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 boeing 53 108. 109. 27.4 4.04 3370.
## 2 boeing 69 102. 103. 27.8 4.12 2988.
## 3 boeing 61 71.1 NA 18.6 4.43 1145.
## 4 boeing 56 85.8 NA 30.7 3.88 1664.
## 5 boeing 70 59.9 NA 32.4 4.03 1050.
## 6 boeing 55 75.0 NA 41.2 4.20 1627.
str(FAA1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 800 obs. of 8 variables:
## $ aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : int 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
str(FAA2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 7 variables:
## $ aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ no_pasg : int 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
We can see that duration variable is present in FAA1 data set but not in FAA2 data set
# adding a column duration with all NA's to FAA2
FAA2$duration <- NA
# merging data sets
FAA_merged <- rbind(FAA1,FAA2)
dim(FAA_merged)
## [1] 950 8
# removing duplicates
duplicate_index <- duplicated(FAA_merged[,-2])
FAA_merged1 <- FAA_merged[!duplicate_index,]
dim(FAA_merged1)
## [1] 850 8
100 duplicate values from the combined data set removed
str(FAA_merged1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 850 obs. of 8 variables:
## $ aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : int 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
summary(FAA_merged1)
## aircraft duration no_pasg speed_ground
## airbus:450 Min. : 14.76 Min. :29.0 Min. : 27.74
## boeing:400 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90
## Median :153.95 Median :60.0 Median : 79.64
## Mean :154.01 Mean :60.1 Mean : 79.45
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06
## Max. :305.62 Max. :87.0 Max. :141.22
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :101.15 Median :30.093 Median :4.008 Median :1258.09
## Mean :103.80 Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :642
FAA_Merged1: After removing duplicates, we have 850 observations and 8 variables
There are 2 types of aircrafts: airbus and boeing. We may have to observe some of the variables separately to see if there are some differences between these aircrafts.
Speed_air has 642 NA’s. We need to evaluate if we have to keep speed_air in our analysis or not.
100 duplicate values were present after merging the 2 files.
There are abnormal values present for different variables. For example, the minimum value of height is negative, minimum value for duration is only 14 minutes, some planes have landing distance>6000 feet, speed_ground is also above specified normal limits in some cases, etc.
FAA_merged2 <- FAA_merged1 %>%
filter((is.na(duration) | duration >40) &
(speed_ground>=30 & speed_ground<=140) &
(height>=6) & (distance<6000) &
(is.na(speed_air) | (speed_air>=30 & speed_air<=140)))
dim(FAA_merged2)
## [1] 831 8
par(mfrow=c(2,4))
hist(FAA_merged2$duration, main="Distribution",xlab= "Duration", col= "darkmagenta")
hist(FAA_merged2$no_pasg, main="Distribution",xlab= "No_pasg", col= "darkmagenta")
hist(FAA_merged2$speed_ground, main="Distribution",xlab= "Speed_ground", col= "darkmagenta")
hist(FAA_merged2$speed_air, main="Distribution",xlab= "Speed_air", col= "darkmagenta")
hist(FAA_merged2$height, main="Distribution",xlab= "Height", col= "darkmagenta")
hist(FAA_merged2$pitch, main="Distribution",xlab= "Pitch", col= "darkmagenta")
hist(FAA_merged2$distance, main="Distribution",xlab= "Distance", col= "darkmagenta")
Speed_air has missing observations after cleaning.
Distribution of all variables except speed_air and distance seem to follow normal distribution. Speed air and distance are skewed to right.
Speed_air has missing observations after cleaning.
GGally::ggpairs(data = FAA_merged2)
Yes. The scatter plots match with the correlation values obtained (for linear relationship)
variables <- c("aircraft", "duration","no_pasg" ,"speed_ground", "speed_air" ,"height" ,"pitch")
coeff13 <- rep(NA,length(variables))
p_val13 <- rep(NA,length(variables))
fit_all13 <- data.frame(variables, coeff13,p_val13)
for (i in seq_along(variables))
{
fit_all13$coeff13[i] <- summary(lm(FAA_merged2$distance ~ FAA_merged2[[variables[i]]]))$coefficients[,1][2]
fit_all13$p_val13[i] <- summary(lm(FAA_merged2$distance ~ FAA_merged2[[variables[i]]]))$coefficients[,4][2]
}
table2 <- fit_all13 %>% mutate(sign_coef13 = ifelse(coeff13>0,"+","-")) %>% select(-coeff13) %>% arrange(p_val13)
table2
## variables p_val13 sign_coef13
## 1 speed_ground 4.766371e-252 +
## 2 speed_air 2.500461e-97 +
## 3 aircraft 3.526194e-12 +
## 4 height 4.123860e-03 +
## 5 pitch 1.208124e-02 +
## 6 duration 1.514002e-01 -
## 7 no_pasg 6.092520e-01 -
scaled_FAA_merged2 <- as.data.frame(scale(FAA_merged2[,-1]))
scaled_FAA_merged2$aircraft <- FAA_merged2$aircraft
scaled_FAA_merged2 <- scaled_FAA_merged2 %>% select(aircraft, everything())
summary(scaled_FAA_merged2)
## aircraft duration no_pasg speed_ground
## airbus:444 Min. :-2.33354 Min. :-4.145514 Min. :-2.45353
## boeing:387 1st Qu.:-0.72687 1st Qu.:-0.674829 1st Qu.:-0.71220
## Median :-0.01016 Median :-0.007389 Median : 0.01341
## Mean : 0.00000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.72156 3rd Qu.: 0.660050 3rd Qu.: 0.65999
## Max. : 3.11988 Max. : 3.596784 Max. : 2.84174
## NA's :50
## speed_air height pitch distance
## Min. :-1.3847 Min. :-2.47632 Min. :-3.26772 Min. :-1.6520
## 1st Qu.:-0.7452 1st Qu.:-0.70804 1st Qu.:-0.69259 1st Qu.:-0.7020
## Median :-0.2430 Median :-0.02972 Median :-0.00783 Median :-0.2904
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.6029 3rd Qu.: 0.66905 3rd Qu.: 0.69323 3rd Qu.: 0.4620
## Max. : 3.0223 Max. : 3.01366 Max. : 3.64933 Max. : 4.3058
## NA's :628
coeff14 <- rep(NA,length(variables))
p_val14 <- rep(NA,length(variables))
fit_all14 <- data.frame(variables, coeff14,p_val14)
for (i in seq_along(variables))
{
fit_all14$coeff14[i] <- summary(lm(scaled_FAA_merged2$distance ~ scaled_FAA_merged2[[variables[i]]]))$coefficients[,1][2]
fit_all14$p_val14[i] <- summary(lm(scaled_FAA_merged2$distance ~ scaled_FAA_merged2[[variables[i]]]))$coefficients[,4][2]
}
table3 <- fit_all14 %>% mutate(sign_coef14 = ifelse(coeff14>0,"+","-")) %>% select(-coeff14) %>% arrange(p_val14)
table3
## variables p_val14 sign_coef14
## 1 speed_ground 4.766371e-252 +
## 2 speed_air 2.500461e-97 +
## 3 aircraft 3.526194e-12 +
## 4 height 4.123860e-03 +
## 5 pitch 1.208124e-02 +
## 6 duration 1.514002e-01 -
## 7 no_pasg 6.092520e-01 -
Yes. The results are consistent. Speed_air, Speed_ground, and Speed_aircraft are the most important paramteters.
fit_all2.1 <- lm(distance ~ speed_ground , data = FAA_merged2)
fit_all2.2 <- lm(distance ~ speed_air , data = FAA_merged2)
fit_all2.3 <- lm(distance ~ speed_ground + speed_air , data = FAA_merged2)
# coeff for fit_all2.1 and pval for fit_all2.1
summary(fit_all2.1)$coefficients[,1]
## (Intercept) speed_ground
## -1773.94071 41.44219
summary(fit_all2.1)$coefficients[,4]
## (Intercept) speed_ground
## 2.172875e-110 4.766371e-252
# coeff for fit_all2.2 and pval for fit_all2.2
summary(fit_all2.2)$coefficients[,1]
## (Intercept) speed_air
## -5455.7088 79.5321
summary(fit_all2.2)$coefficients[,4]
## (Intercept) speed_air
## 5.817937e-67 2.500461e-97
# coeff for fit_all2.3 and pval for fit_all2.3
summary(fit_all2.3)$coefficients[,1]
## (Intercept) speed_ground speed_air
## -5462.28328 -14.37343 93.95880
summary(fit_all2.3)$coefficients[,4]
## (Intercept) speed_ground speed_air
## 6.587877e-67 2.584769e-01 6.990846e-12
# Checking correlation between speed_air and speed_ground
GGally::ggpairs(data = FAA_merged2, columns =4:5 )
When we individually regress landing distance on speed air and speed ground, both turn out to be significant with positive coefficients. However, when we regress ld on them together, the sign of the coefficient of speed ground changes and pval is larger than 0.05. This is possibly due to the very high correlation between the two variables (>98%). I would prefer to keep speed_ground in my data. This is because speed_air has a lot of missing values and have only 195 observations available. Speed_ground on the other hand has 781 observations.
r.sq_17 <-c()
r.sq_17[1] <- summary(lm(distance ~ speed_ground , data = FAA_merged2))$r.squared
r.sq_17[2] <- summary(lm(distance ~ speed_ground + aircraft , data = FAA_merged2))$r.squared
r.sq_17[3] <- summary(lm(distance ~ speed_ground + aircraft + height, data = FAA_merged2))$r.squared
r.sq_17[4] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch , data = FAA_merged2))$r.squared
r.sq_17[5] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch + duration , data = FAA_merged2))$r.squared
r.sq_17[6] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch + duration + no_pasg, data = FAA_merged2))$r.squared
parameters <-c(1:6)
model_17 <- data.frame(r.sq_17,parameters)
model_17
## r.sq_17 parameters
## 1 0.7503784 1
## 2 0.8251319 2
## 3 0.8488989 3
## 4 0.8493717 4
## 5 0.8504184 5
## 6 0.8506023 6
model_17 %>% ggplot(aes(x=as.factor(parameters),y=r.sq_17)) +
geom_point(size=3) +
xlab("Number of variables in model") +
ylab("R sq Value") +
ggtitle("Effect of Addition of Variables in Linear Regression Model on R sq Values") +
theme(plot.title = element_text(hjust = 0.5))
We can see that Rsq keeps on increasing on addition of variables. However, after 3 variables, there is very small increase in r sq value. This implies that the addition of new variables brought no new information to make our model better. However, the model did become more complex.
adj.r.sq_18 <-c()
adj.r.sq_18[1] <- summary(lm(distance ~ speed_ground , data = FAA_merged2))$adj.r.squared
adj.r.sq_18[2] <- summary(lm(distance ~ speed_ground + aircraft , data = FAA_merged2))$adj.r.squared
adj.r.sq_18[3] <- summary(lm(distance ~ speed_ground + aircraft + height, data = FAA_merged2))$adj.r.squared
adj.r.sq_18[4] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch , data = FAA_merged2))$adj.r.squared
adj.r.sq_18[5] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch + duration, data = FAA_merged2))$adj.r.squared
adj.r.sq_18[6] <- summary(lm(distance ~ speed_ground + aircraft + height + pitch + duration + no_pasg, data = FAA_merged2))$adj.r.squared
parameters <-c(1:6)
model_18 <- data.frame(adj.r.sq_18,parameters)
model_18
## adj.r.sq_18 parameters
## 1 0.7500773 1
## 2 0.8247095 2
## 3 0.8483508 3
## 4 0.8486423 4
## 5 0.8494534 5
## 6 0.8494442 6
model_18 %>% ggplot(aes(x=as.factor(parameters),y=adj.r.sq_18)) +
geom_point(size=3) +
xlab("Number of variables in model") +
ylab("Adj. R sq Value") +
ggtitle("Effect of Addition of Variables in Linear Regression Model on Adj R sq Values") +
theme(plot.title = element_text(hjust = 0.5))
We can see thatthe Adj Rsq first increases and then decrease slightly on addition of variables. We observe this decrease after on adding the 3rd variable (height). This implies that the addition of new variables brought no new information to make our model better. Adj R Sq paramter penalizes the model when this happens.
aic19 <- c()
aic19[1] <- AIC(lm(distance ~ speed_ground , data = FAA_merged2))
aic19[2] <- AIC(lm(distance ~ speed_ground + aircraft , data = FAA_merged2))
aic19[3] <- AIC(lm(distance ~ speed_ground + aircraft + height, data = FAA_merged2))
aic19[4] <- AIC(lm(distance ~ speed_ground + aircraft + height + pitch, data = FAA_merged2))
aic19[5] <- AIC(lm(distance ~ speed_ground + aircraft + height + pitch + duration , data = FAA_merged2))
aic19[6] <- AIC(lm(distance ~ speed_ground + aircraft + height + pitch + duration + no_pasg, data = FAA_merged2))
aic19
## [1] 12508.81 12215.05 12095.65 12095.05 11378.84 11379.88
parameters <-c(1:6)
model_19 <- data.frame(aic19,parameters)
model_19 %>% ggplot(aes(x=as.factor(parameters),y=aic19)) +
geom_point(size=3) +
xlab("Number of variables in model") +
ylab("AIC Value") +
ggtitle("Effect of Addition of Variables in Linear Regression Model on AIC Values") +
theme(plot.title = element_text(hjust = 0.5))
AIC decreases till model with 3 variables. Then its value remains almost the same (increases slightly. So, AIC suggests picking up the first 3 variables.
Based on steps 17-19, I would pick the following variables:
Speed Ground
Aircraft
Height
fit_base <- lm(distance ~ 1, data=FAA_merged2[,-5])
fit_max <- lm(distance~.,data=FAA_merged2[,-5])
require(MASS)
aic_for_fir <- stepAIC(fit_base, direction = 'forward', scope=list(upper=fit_max,lower=fit_base))
## Start: AIC=11299.8
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_ground 1 480561689 157699570 10104
## + aircraft 1 33759132 604502127 11220
## + height 1 6866417 631394842 11256
## + pitch 1 3010731 635250529 11262
## + duration 1 1685114 636576145 11263
## <none> 638261260 11263
## + no_pasg 1 181284 638079976 11265
##
## Step: AIC=10148.53
## distance ~ speed_ground
##
## Df Sum of Sq RSS AIC
## + aircraft 1 47102191 110597379 9810.8
## + height 1 14123617 143575953 10027.6
## + pitch 1 8246571 149453000 10061.0
## <none> 157699570 10103.6
## + no_pasg 1 154554 157545016 10104.8
## + duration 1 50570 157649000 10105.4
##
## Step: AIC=9854.77
## distance ~ speed_ground + aircraft
##
## Df Sum of Sq RSS AIC
## + height 1 15048298 95549081 9691.2
## <none> 110597379 9810.8
## + pitch 1 182007 110415372 9811.4
## + no_pasg 1 41575 110555804 9812.5
## + duration 1 9394 110587985 9812.7
##
## Step: AIC=9735.37
## distance ~ speed_ground + aircraft + height
##
## Df Sum of Sq RSS AIC
## <none> 95549081 9691.2
## + no_pasg 1 120379 95428702 9692.2
## + pitch 1 71174 95477907 9692.6
## + duration 1 4446 95544635 9693.2
summary(aic_for_fir)
##
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height, data = FAA_merged2[,
## -5])
##
## Residuals:
## Min 1Q Median 3Q Max
## -711.95 -226.73 -90.17 130.04 1471.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2512.2433 68.1974 -36.84 <2e-16 ***
## speed_ground 42.4024 0.6483 65.41 <2e-16 ***
## aircraftboeing 496.0452 24.2975 20.41 <2e-16 ***
## height 14.1478 1.2405 11.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 349.1 on 827 degrees of freedom
## Multiple R-squared: 0.8489, Adjusted R-squared: 0.8484
## F-statistic: 1549 on 3 and 827 DF, p-value: < 2.2e-16
Variable selection algorithm gives a model with same variables as in the previous model selected manually.