Overview

Name: Shravya Challa

UCID: M13940819

Background: Flight landing.

Motivation: To reduce the risk of landing overrun.

Goal: To study what factors and how they would impact the landing distance of a commercial flight.

Data: Landing data (landing distance and other parameters) from 950 commercial flights (not real data set but simulated from statistical models).

Variable dictionary:

Aircraft: The make of an aircraft (Boeing or Airbus).

Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.

No_pasg: The number of passengers in a flight.

Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.

Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.

Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

Part1

library(readxl)
library(tidyr)
library(ggplot2)
library(dplyr)
library(MASS)
library(broom)
library(pROC)

Initial exploration of data

Step1: Read two files FAA1.xls and FAA2.xls into R system.

Using read_excel from readxl library to read the files

faa1 <- read_excel("C:/Users/shrav/OneDrive/D/courses/Statistical model/FAA1.xls")
faa2 <- read_excel("C:/Users/shrav/OneDrive/D/courses/Statistical model/FAA2.xls")

Step2: Check the structure of 2 data sets

str(faa1)
str(faa2)
  • FAA1 has 800 observations of 8 variables
  • FAA2 has 150 observations of 7 variables. FAA2 is missing the duration variable in the dataset.

Step3: Merge the 2 datasets and check for duplicates

  • First we add a duration column to faa2 dataset with NA values and we ‘rbind’ the two dataframes to merge.
  • The merged dataset would have 950 observations with 8 variables.
  • We use the ‘duplicated’ function to check for duplicates on the merged dataset without the duration column.
  • 100 duplicate values are removed and the resulting dataset would have 850 observations.
# merge datasets
faa2['duration'] <- NA
faa <- rbind(faa1, faa2)

# remove duplicates
idx <- duplicated(faa[,-2])
faa <- faa[!idx,]

Step4: Check the structure of the dataset and summary statistics

  • str function is used to check the structure of the dataset.
  • We use the summary function to get the summary statistics of each variable
str(faa)
## tibble[,8] [850 x 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:850] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:850] 109 103 NA NA NA ...
##  $ height      : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:850] 3370 2988 1145 1664 1050 ...
summary(faa)
##    aircraft            duration         no_pasg      speed_ground   
##  Length:850         Min.   : 14.76   Min.   :29.0   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90  
##  Mode  :character   Median :153.95   Median :60.0   Median : 79.64  
##                     Mean   :154.01   Mean   :60.1   Mean   : 79.45  
##                     3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06  
##                     Max.   :305.62   Max.   :87.0   Max.   :141.22  
##                     NA's   :50                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.25   1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :101.15   Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :103.80   Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:109.40   3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :642
  • There are 850 observations of 8 variables after the duplicates are removed.
  • Duration and speed_air have 50 and 642 missing values respectively.
  • Height has negative value that is inconsistent with expected observations.
  • Distance has a huge difference between Mean and Median, inferring that the data is right skewed and has outliers.

Step5: Initial observations to the FAA Agent

  • The duration column is missing in the second dataset resulting in 150 missing observations in the merged dataset and without duration there are 100 duplicate observations which are removed in the final dataset.
  • Speed air is not captured for 642 observations, which amounts to around 70% of the data.
  • Height has negative value, which is not expected for flight data and can impact mean, median and other summary statistics.
  • Third quantile of distance ranges from 1936 to 6533, which is a huge range when compared to the other quantiles. This indicates the existence of outliers and also that the data is right skewed.
  • There are observations that are considered abnormal in Duration, speed_ground, speed_air, height and distance variables.

Data cleaning and further exploration

Step6: Find abnormal values in the dataset and remove them

  • We filter the rows with abnormal values as mentioned in the data dictionary.
  • The indices of the abnormal observations in the columns with missing values are taken separately and then the rows are removed.
faa4 <- faa
faa4 <- faa4[!(faa4$speed_ground < 30 | faa4$speed_ground > 140),]

idx2 <- which(faa4$duration < 40)
faa4 <- faa4[-idx2,]

idx3 <- which(faa4$speed_air < 30 | faa4$speed_air > 140)

faa4 <- faa4[!(faa4$height <6),]
faa4 <- faa4[!(faa4$distance > 6000),]
  • After speed_ground is filtered there are no abnormal values in speed_air, hence no filtering is required for speed_air.
  • Number of rows removed: 19
    • Speed_ground - 3
    • duration - 5
    • speed_air - 0
    • height - 10
    • distance - 1

Step7: Structure of the dataset, summary statistics

str(faa4)
## tibble[,8] [831 x 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:831] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:831] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:831] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:831] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:831] 109 103 NA NA NA ...
##  $ height      : num [1:831] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:831] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:831] 3370 2988 1145 1664 1050 ...
summary(faa4)
##    aircraft            duration         no_pasg       speed_ground   
##  Length:831         Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
##  Class :character   1st Qu.:119.63   1st Qu.:55.00   1st Qu.: 66.20  
##  Mode  :character   Median :154.28   Median :60.00   Median : 79.79  
##                     Mean   :154.78   Mean   :60.06   Mean   : 79.54  
##                     3rd Qu.:189.66   3rd Qu.:65.00   3rd Qu.: 91.91  
##                     Max.   :305.62   Max.   :87.00   Max.   :132.78  
##                     NA's   :50                                       
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.23   1st Qu.:23.530   1st Qu.:3.640   1st Qu.: 893.28  
##  Median :101.12   Median :30.167   Median :4.001   Median :1262.15  
##  Mean   :103.48   Mean   :30.458   Mean   :4.005   Mean   :1522.48  
##  3rd Qu.:109.36   3rd Qu.:37.004   3rd Qu.:4.370   3rd Qu.:1936.63  
##  Max.   :132.91   Max.   :59.946   Max.   :5.927   Max.   :5381.96  
##  NA's   :628
  • There are 831 observations of 8 variables in the resulting sample.
  • Duration still has 50 missing values and speed_air now has 628 missing values.
  • Distance is still skewed to the right, and there is a possibility of outliers
  • The mean and Median of other variables are close enough, hence we can assume that they are mostly symmetrical distributions.

Step8: Histograms for all variables

  • ggplot2 is used to obtain the histograms of all variables in a single plot.
  • The histograms here are combined with density plot to understand the distribution of the variables.
  • Aircraft column is omitted when creating the histograms as it is a character variable.
faa4[,-1] %>%
  gather() %>% 
  ggplot(aes(value)) +                                # ggplot2 histogram & density
  facet_wrap(~ key, scales = "free") +
  geom_histogram(aes(y = stat(density))) +
  geom_density(col = "blue", size = 1)

  • Distance and speed_air are skewed to the right. All other variables are mostly normal.

Step9: Informed findings to the FAA agent

  • 19 abnormal values are removed from the dataset. The final dataset has 831 observations
  • Distance and speed_air are skewed to the right and all other variables are mostly normal
  • Duration still has 50 missing values and speed_air now has 628 missing values.

Initial analysis to identify important factors influencing Landing distance

Step10: Computing correlation between landing distance and every other variable and storing the values in a table

  • cor function is used to obtain the pairwise correlation of landing distance with every other variable X
  • Table1 is created with variable_names, size of correlation and direction of correlation
  • The table is sorted based on the size of the correlation
  • Correlation is calculated only between pairwise complete observations, i.e., non NA observations.
cor_val <- cor(faa4[,c(-1, -8)], faa4$distance, use = "pairwise.complete.obs")
table1 <- data.frame(cor_val)
table1 <- table1 %>%
  mutate(sign = case_when(
    cor_val >= 0 ~ "Positive",
    cor_val < 0 ~ "Negative"
  ))

table1$cor_val <- abs(table1$cor_val)

table1 <- table1 %>% arrange(desc(cor_val))
table1
##                 cor_val     sign
## speed_air    0.94209714 Positive
## speed_ground 0.86624383 Positive
## height       0.09941121 Positive
## pitch        0.08702846 Positive
## duration     0.05138252 Negative
## no_pasg      0.01775663 Negative
  • According to the sorted table1, speed_air has the highest correlation followed by speed_ground with landing distance.
  • Number of passengers and duration have the least correlation with landing distance and are the only variables that have negative correlation.

Step11: Scatter plots with distance

  • ggplot2 with geom_point is used to get the scatter plots of all other variables with distance
faa4[,-1] %>%
  gather(-distance, key = "var", value = "value") %>%
  ggplot(aes(x = value, y = distance)) +
  geom_point() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

  • The scatter plots are consistent with the correlation values obtained earlier.
  • Since height, pitch, duration and no_pasg have very small correlation value, it is difficult to visualize the relation between these variables with the distance variable.

Step12: Convert aircraft to numeric 0/1 and repeat steps 10 and 11

  • Boeing is given a value of 1 and airbus 0
  • we use mutate from dplyr package to add a new variable called ‘aircraft1’
  • We rerun the cor function from step 10 and update table1 with the results.
faa4 <- faa4 %>%
  mutate(aircraft1 = case_when(
    aircraft == "boeing" ~ 1,
    aircraft == "airbus" ~ 0
  ))

cor_val <- cor(faa4[,c(-1, -8)], faa4$distance, use = "pairwise.complete.obs")
table1 <- data.frame(cor_val)
table1 <- table1 %>%
  mutate(sign = case_when(
    cor_val >= 0 ~ "Positive",
    cor_val < 0 ~ "Negative"
  ))

table1$cor_val <- abs(table1$cor_val)

table1 <- table1 %>% arrange(desc(cor_val))
table1
##                 cor_val     sign
## speed_air    0.94209714 Positive
## speed_ground 0.86624383 Positive
## aircraft1    0.23814452 Positive
## height       0.09941121 Positive
## pitch        0.08702846 Positive
## duration     0.05138252 Negative
## no_pasg      0.01775663 Negative
  • After sorting the values in table1, aircraft1 ranks 3rd.
faa4[,-1] %>%
  gather(-distance, key = "var", value = "value") %>%
  ggplot(aes(x = value, y = distance)) +
  geom_point() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

  • As aircraft1 is still a categorical variable, we cannot completely visualize the correlation from scatter matrix.

Regression using single factor each time

Step13: Regress Y on each of the X variables. Store p-value and direction of regression co-efficient in table2

  • As we have to regress landing distance on each of the other variable, we can use loops to achieve this.
  • The column names of the FAA dataset except aircraft and landing distance are stored in a vector
  • An lm.test matrix is created to store the models.
  • reformulate is used within lm function to create a formula of distance and column.
  • Table2 has variable names, size of p-value and direction of coefficient.
  • glance function from ‘broom’ package is used to extract the p-value of each model.
  • tidy function from ‘broom’ package is used to extract the co-efficient of the model.
col10 <- names(faa4)[-c(1,8)]

lm.test <- vector("list", length(col10))
table2 <- data.frame(matrix(ncol=3,nrow=0))
colnames(table2) <- c("variable names", "p_value", "Direction")

for(i in seq_along(col10)){
  lm.test[[i]] <- lm(reformulate(col10[i], "distance"), data = faa4)
  
  coef <- tidy(lm.test[[i]])[2,2]
  val <- case_when(
    coef >= 0  ~ "Positive",
    coef < 0  ~ "Negative"
  )
  table2[nrow(table2)+1,] = c(col10[i], as.numeric(glance(lm.test[[i]])$p.value), val)
}

table2$p_value <- as.numeric(table2$p_value)
table2 <- table2 %>% arrange(p_value)
table2
##   variable names       p_value Direction
## 1   speed_ground 4.766371e-252  Positive
## 2      speed_air  2.500461e-97  Positive
## 3      aircraft1  3.526194e-12  Positive
## 4         height  4.123860e-03  Positive
## 5          pitch  1.208124e-02  Positive
## 6       duration  1.514002e-01  Negative
## 7        no_pasg  6.092520e-01  Negative
  • The directions of the coefficients appear to be consistent with the direction of correlation observed earlier.
  • Duration and no_pasg do not form significant models.
  • Speed_ground is the most significant factor and no_pasg is the least significant

Step14: Standardize the regressors, repeat step13 and store the coefficients in table3

  • We run a similar loop as in step13
  • We standardize the regressors using the formula - X’= {X-mean(X)}/sd(X)
  • A new variable called ‘data’ is created in faa4 dataset for the standardized variable.
  • table3 is created with variable names, coefficient size and direction of the coefficient.
col10 <- names(faa4)[-c(1,8)]

lm.test2 <- vector("list", length(col10))
table3 <- data.frame(matrix(ncol=3,nrow=0))
colnames(table3) <- c("variable names", "coef", "Direction")


for(i in seq_along(col10)){
  data1 <- faa4[,col10[i]]
  colnames(data1)[1] <- "x1"
  faa4["data"] <- (data1$x1 - mean(data1$x1, na.rm = T))/sd(data1$x1, na.rm = T)

  lm.test2[[i]] <- lm(reformulate("data", "distance"), data = faa4)
  coef <- tidy(lm.test2[[i]])[2,2]
  val <- case_when(
    coef >= 0  ~ "Positive",
    coef < 0  ~ "Negative"
  )
  table3[nrow(table3)+1,] = c(col10[i], abs(tidy(lm.test2[[i]])[2,2]), val)
}
table3 <- table3 %>% arrange(desc(coef))
table3
##   variable names      coef Direction
## 1   speed_ground 776.44740  Positive
## 2      speed_air 774.34654  Positive
## 3      aircraft1 213.45802  Positive
## 4         height  89.10606  Positive
## 5          pitch  78.00693  Positive
## 6       duration  46.48013  Negative
## 7        no_pasg  15.91595  Negative
  • The direction of the coefficients is still consistent with the observations made earlier.
  • The coefficient values for speed_ground and speed_air are similar.
  • no_pasg has the smallest coefficient, which implies that most of the distance values is coming from the intercept.

Step15: Comparing tables 1,2,3 and ranking the variables based on significance

  • The results obtained in all the three tables are consistent.
  • We rank the variables based on their relative importance.
table0 <- data.frame(variable_names = table2$`variable names`, rank = rank(table2$p_value))
table0
##   variable_names rank
## 1   speed_ground    1
## 2      speed_air    2
## 3      aircraft1    3
## 4         height    4
## 5          pitch    5
## 6       duration    6
## 7        no_pasg    7
  • speed_ground is the most important factor, followed by speed_air
  • no_pasg is least significant among the variables.

Check collinearity

Step16: compare regression coefficients

Model 1: LD ~ Speed_ground

Model 2: LD ~ Speed_air

Model 3: LD ~ Speed_ground + Speed_air

model1 <- lm(distance~speed_ground, data=faa4)
model2 <- lm(distance~speed_air, data=faa4)
model3 <- lm(distance~speed_ground+speed_air, data=faa4)

summary(model1)
## 
## Call:
## lm(formula = distance ~ speed_ground, data = faa4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -897.09 -319.16  -72.09  210.83 1798.88 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1773.9407    67.8388  -26.15   <2e-16 ***
## speed_ground    41.4422     0.8302   49.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 448.1 on 829 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7501 
## F-statistic:  2492 on 1 and 829 DF,  p-value: < 2.2e-16
summary(model2)
## 
## Call:
## lm(formula = distance ~ speed_air, data = faa4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -776.21 -196.39    8.72  209.17  624.34 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5455.709    207.547  -26.29   <2e-16 ***
## speed_air      79.532      1.997   39.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.3 on 201 degrees of freedom
##   (628 observations deleted due to missingness)
## Multiple R-squared:  0.8875, Adjusted R-squared:  0.887 
## F-statistic:  1586 on 1 and 201 DF,  p-value: < 2.2e-16
summary(model3)
## 
## Call:
## lm(formula = distance ~ speed_ground + speed_air, data = faa4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -819.74 -202.02    3.52  211.25  636.25 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5462.28     207.48 -26.327  < 2e-16 ***
## speed_ground   -14.37      12.68  -1.133    0.258    
## speed_air       93.96      12.89   7.291 6.99e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 200 degrees of freedom
##   (628 observations deleted due to missingness)
## Multiple R-squared:  0.8883, Adjusted R-squared:  0.8871 
## F-statistic:   795 on 2 and 200 DF,  p-value: < 2.2e-16
  • From summary of the three models, we can observe that when speed_air and speed_ground are modeled individually, the direction of coefficient is the same as their correlation direction with distance.

  • In model3, speed_ground direction is changed and it’s p_value indicates that the coefficient is not significant.

  • This is a contradiction from the tables earlier.

  • Check correlation between speed_ground and speed_air

cor(faa4$speed_air, faa4$speed_ground, use = "pairwise.complete.obs")
## [1] 0.9879383
  • There is a high correlation between speed_air and speed_ground indicating that if these two variables are modeled together there is a high chance of multi collinearity.

  • Since there is multicollinearity, the results of model3 are unreliable and there is a sign change for coefficient of speed_ground.

  • To overcome this issue we choose only speed_ground for the following reasons

    • Speed_ground has no missing values as compared to speed_air with 628 missing values. Hence, speed_ground give more comprehensive results than speed_air.
    • The rank of speed_ground is better in table0 based on significance value
    • The coefficient size of speed_ground is larger than speed_air

Variable selection based on ranking in table0

Step17: Incremental models based on ranking in table0

  • R-squared value is extracted from the summary of the models and stored in a vector
model1 <- lm(distance~speed_ground, data = faa4)
model2 <- lm(distance~speed_ground + speed_air, data = faa4)
model3 <- lm(distance~speed_ground + speed_air + aircraft1, data = faa4)
model4 <- lm(distance~speed_ground + speed_air + aircraft1 + height, data = faa4)
model5 <- lm(distance~speed_ground + speed_air+ aircraft1 + height + pitch, data = faa4)
model6 <- lm(distance~speed_ground + speed_air+ aircraft1 + height + pitch + duration, data = faa4)
model7 <- lm(distance~speed_ground + speed_air + aircraft1 + height + pitch + duration + no_pasg, data = faa4)

sum_1 <- c()

sum_1 <- append(sum_1, summary(model1)$r.squared)
sum_1 <- append(sum_1, summary(model2)$r.squared)
sum_1 <- append(sum_1, summary(model3)$r.squared)
sum_1 <- append(sum_1, summary(model4)$r.squared)
sum_1 <- append(sum_1, summary(model5)$r.squared)
sum_1 <- append(sum_1, summary(model6)$r.squared)
sum_1 <- append(sum_1, summary(model7)$r.squared)
plot(sum_1, pch=20, cex=1, type = 'b')

  • As the number of variables increases, the r-squared value increases.
  • After all the variables are included in the model, almost 97.5% of variation in distance is explained by the model.
  • There is a huge difference between the r-squared value after variable 2 is included in the model, and after adding variable 4, the value only increased slightly.

Step18: Repeat step17 with adjusted R squared

sum_2 <- c()

sum_2 <- append(sum_2, summary(model1)$adj.r.squared)
sum_2 <- append(sum_2, summary(model2)$adj.r.squared)
sum_2 <- append(sum_2, summary(model3)$adj.r.squared)
sum_2 <- append(sum_2, summary(model4)$adj.r.squared)
sum_2 <- append(sum_2, summary(model5)$adj.r.squared)
sum_2 <- append(sum_2, summary(model6)$adj.r.squared)
sum_2 <- append(sum_2, summary(model7)$adj.r.squared)
plot(sum_2, pch=20, cex=1, type = 'b')

  • The plot for adjusted R squared is similar to that of R squared, and we can make same inferences as in step 17.

Step19: Repeat step17 with AIC values

aic_1 <- c()

aic_1 <- append(aic_1, AIC(model1))
aic_1 <- append(aic_1, AIC(model2))
aic_1 <- append(aic_1, AIC(model3))
aic_1 <- append(aic_1, AIC(model4))
aic_1 <- append(aic_1, AIC(model5))
aic_1 <- append(aic_1, AIC(model6))
aic_1 <- append(aic_1, AIC(model7))
plot(aic_1, pch=20, cex=1, type = 'b')

  • There is a huge dip in AIC value after variable 2 is added, implying that model2 is better than model1.
  • After model2, AIC values only get better with small values as more variables added.

Step20: Based on steps 18-19, what variables can be selected to build a model for landing distance

  • Based on the Adjusted R squared and AIC values, Speed_ground, speed_air and aircraft1 can be used to model landing distance.
  • Together they explain upto 90% variation in landing distance.
  • But these values are not enough to make a decision, as correlation between speed_ground and speed_air is 0.98 and we would encounter multicollinearity issue.

Step21: StepAIC function to perform forward variable selection

  • To perform StepAIC function, we omit the missing(NA) values from the dataset and store it in a new dataframe called faa5.
  • StepAIC is used from MASS package.
  • We create a null model with no variables except intercept and make a forward seletion from there.
faa5 <- faa4[,-1]
faa5 <- na.omit(faa5)
mlr <- lm(distance~1, data = faa5)
stepAIC(mlr, direction = "forward", scope = list(upper = ~speed_ground + speed_air + aircraft1 + height + pitch + duration + no_pasg, lower = mlr))
## Start:  AIC=2622.4
## distance ~ 1
## 
##                Df Sum of Sq       RSS    AIC
## + speed_air     1 118926290  14749519 2194.6
## + speed_ground  1 115311069  18364740 2237.3
## + aircraft1     1   3978580 129697229 2618.5
## <none>                      133675809 2622.4
## + height        1    445916 133229893 2623.7
## + duration      1    367280 133308529 2623.9
## + pitch         1    154735 133521074 2624.2
## + no_pasg       1    141913 133533895 2624.2
## 
## Step:  AIC=2194.58
## distance ~ speed_air
## 
##                Df Sum of Sq      RSS    AIC
## + aircraft1     1   8088045  6661474 2041.6
## + height        1   2623377 12126142 2158.4
## + pitch         1    847903 13901616 2185.0
## <none>                      14749519 2194.6
## + no_pasg       1    142098 14607421 2194.7
## + speed_ground  1     68916 14680603 2195.7
## + duration      1     14495 14735024 2196.4
## 
## Step:  AIC=2041.58
## distance ~ speed_air + aircraft1
## 
##                Df Sum of Sq     RSS    AIC
## + height        1   3217699 3443775 1914.9
## <none>                      6661474 2041.6
## + duration      1     61424 6600051 2041.8
## + no_pasg       1     47410 6614064 2042.2
## + speed_ground  1     39437 6622037 2042.4
## + pitch         1     14544 6646931 2043.2
## 
## Step:  AIC=1914.92
## distance ~ speed_air + aircraft1 + height
## 
##                Df Sum of Sq     RSS    AIC
## + no_pasg       1     39886 3403889 1914.7
## <none>                      3443775 1914.9
## + duration      1     12694 3431082 1916.2
## + pitch         1      8137 3435638 1916.5
## + speed_ground  1      6494 3437282 1916.5
## 
## Step:  AIC=1914.65
## distance ~ speed_air + aircraft1 + height + no_pasg
## 
##                Df Sum of Sq     RSS    AIC
## <none>                      3403889 1914.7
## + duration      1    9733.8 3394155 1916.1
## + pitch         1    8832.0 3395057 1916.1
## + speed_ground  1    5823.5 3398066 1916.3
## 
## Call:
## lm(formula = distance ~ speed_air + aircraft1 + height + no_pasg, 
##     data = faa5)
## 
## Coefficients:
## (Intercept)    speed_air    aircraft1       height      no_pasg  
##   -6263.754       82.032      432.074       13.776       -2.041
  • Based on the StepAIC results, we can choose speed_air, aircraft1, height and no_pasg to model landing distance.
  • This is different from the observations made earlier, because for StepAIC the data taken is smaller with missing value rows removed.
  • In the earlier dataset we could have chosen speed_ground and maybe aircraft1 to model landing distance.

Part 2

Create Binary responses

Step1: Create long.landing and risky.landing binary variables

  • We use the ifelse function to form a condition around distance and store the relevant value in the required variable
faa6 <- faa4
faa6 <- faa6[, -10]
faa6$long.landing <- ifelse(faa6$distance > 2500, 1, 0)
faa6$risky.landing <- ifelse(faa6$distance > 3000, 1, 0)
faa6 <- faa6[,-8]
str(faa6)
## tibble[,10] [831 x 10] (S3: tbl_df/tbl/data.frame)
##  $ aircraft     : chr [1:831] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration     : num [1:831] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg      : num [1:831] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground : num [1:831] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air    : num [1:831] 109 103 NA NA NA ...
##  $ height       : num [1:831] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch        : num [1:831] 4.04 4.12 4.43 3.88 4.03 ...
##  $ aircraft1    : num [1:831] 1 1 1 1 1 1 1 1 1 1 ...
##  $ long.landing : num [1:831] 1 1 0 0 0 0 0 0 0 0 ...
##  $ risky.landing: num [1:831] 1 0 0 0 0 0 0 0 0 0 ...
  • Distance is discarded from the dataset and missing rows are still retained.
  • We will have 831 observations of 10 variables in the newly formed dataset.

Identifying important factors using binary data long.landing

Step2: Pie chart and histogram of long landing

  • hist function is used to create a histogram and pie function is used to display the pie chart of long landing.
  • The labels for the pie chart are the value of long.landing with its associated percentage in the data.
hist(faa6$long.landing, prob = TRUE,
     main = "Histogram with density curve")
lines(density(faa6$long.landing), col = 4, lwd = 2)

long_landing <- table(faa6$long.landing)
labels <- paste0(c(0,1), " = ", round(100 * long_landing/sum(long_landing), 2), "%")

pie(long_landing, labels = labels, main = "Pie chart of long.landing")

  • From the histogram and pie chart it is clear that only 12.39% of landings were long landings.

Step3: Single factor regression analysis using logistic regression

Single factor regression with original values

table1_2_df <- data.frame(matrix(ncol=5,nrow=0))
colnames(table1_2_df) <- c("variable names", "size of coef", "odds ratio", "Direction", "pvalue")

col10 <- names(faa6)[-c(1,9,10)]
glm.test <- vector("list", length(col10))


for(i in seq_along(col10)){
  glm.test[[i]] <- glm(reformulate(col10[i], "long.landing"), family = binomial, data = faa6)
  
  coef <- tidy(glm.test[[i]])[2,2]
  val <- case_when(
    coef >= 0  ~ "Positive",
    coef < 0  ~ "Negative"
  )
  table1_2_df[nrow(table1_2_df)+1,] = c(col10[i], abs(coef), exp(coef), val, as.numeric(summary(glm.test[[i]])$coefficients[2,4]))
}

table1_2_df <- table1_2_df %>% arrange(pvalue)
table1_2_df
##   variable names size of coef odds ratio Direction       pvalue
## 1   speed_ground  0.472345752  1.6037518  Positive 3.935339e-14
## 2      speed_air  0.512321766  1.6691621  Positive 4.334124e-11
## 3      aircraft1  0.864119860  2.3729167  Positive 8.398591e-05
## 4          pitch  0.400527824  1.4926123  Positive 4.664982e-02
## 5         height  0.008623997  1.0086613  Positive 4.218576e-01
## 6        no_pasg  0.007256406  0.9927699  Negative 6.058565e-01
## 7       duration  0.001070492  0.9989301  Negative 6.305122e-01
  • From the results table, it is evident that only speed_ground, speed_air, aircraft1 and pitch are significant in the model and have positive coefficients.
  • Height (with positive coefficient), Duration and No_pasg (with negative coefficients) are not significant in the model.
  • Aircraft1 has the highest odds ratio followed by speed_air, speed_ground and pitch in the same order.

Single factor regression with standardized values

table2_2_df <- data.frame(matrix(ncol=5,nrow=0))
colnames(table2_2_df) <- c("variable names", "size of coef", "odds ratio", "Direction", "pvalue")

glm.test2 <- vector("list", length(col10))

for(i in seq_along(col10)){
  data1 <- faa6[,col10[i]]
  colnames(data1)[1] <- "x1"
  faa6["data"] <- (data1$x1 - mean(data1$x1, na.rm = T))/sd(data1$x1, na.rm = T)

  glm.test2[[i]] <- glm(reformulate("data", "long.landing"), family = binomial, data = faa6)
  
  coef <- tidy(glm.test2[[i]])[2,2]
  val <- case_when(
    coef >= 0  ~ "Positive",
    coef < 0  ~ "Negative"
  )
  table2_2_df[nrow(table2_2_df)+1,] = c(col10[i], abs(coef), exp(coef), val, as.numeric(summary(glm.test2[[i]])$coefficients[2,4]))
}

table2_2_df <- table2_2_df %>% arrange(pvalue)
table2_2_df
##   variable names size of coef   odds ratio Direction       pvalue
## 1   speed_ground   8.84971670 6972.4134377  Positive 3.935339e-14
## 2      speed_air   4.98810682  146.6585091  Positive 4.334124e-11
## 3      aircraft1   0.43130192    1.5392602  Positive 8.398591e-05
## 4          pitch   0.21090555    1.2347957  Positive 4.664982e-02
## 5         height   0.08438418    1.0880468  Positive 4.218576e-01
## 6        no_pasg   0.05436003    0.9470911  Negative 6.058565e-01
## 7       duration   0.05175818    0.9495585  Negative 6.305122e-01
  • Even after standardizing the data, the p-values and the order of the factors remain same across the tables. Hence, we rank the factors based on the p-value as seen in both table1 and table2 above.

step4: Visualizations for significant factors

  • A for loop is used to create visualizations for all the significant variables against long.landing
  • jitter plot is plotted with the long.landing taken as is, whereas plot and ggplot are performed over factored long.landing variable
vars <- table1_2_df$`variable names`[table1_2_df$pvalue<0.05]
for(i in seq_along(vars)) {
  d <- subset (faa6, select = c(vars[i],"long.landing"))
  par(mfrow = c(1,2))
  plot(jitter(d$long.landing, 0.1)~jitter(d[[vars[i]]]),xlab = colnames(d)[1],ylab = "long.landing",pch=".")
  
  d$long.landing <- factor(d$long.landing, levels = c(0, 1))
  
  plot(long.landing ~ ., data = d)
  par(mfrow = c(1,1))
  print(ggplot(d,aes_string(x= colnames(d)[1], fill = "long.landing", colour = "long.landing")) +
    geom_histogram(position = "dodge",binwidth = 1))
}

  • speed_ground and speed_air have long.landing at higher values, while aircraft1 and pitch have long.landing occuring throughout their distributions.
  • aircraft1 has higher number of long.landings occuring at a value of 1, which is boeing aircraft. This explains the higher odds ratio and coeffecient of aircraft1 in the model above.

Step5: Full model, based on the earlier results and removing multicollinearity

  • The most significant variables - speed_ground, aircraft1 and pitch are considered for the full model here.
  • Speed_air is discarded as it has a lot of missing values and is highly correlated with speed_ground
full_model_2 <- glm(long.landing~speed_ground + aircraft1 + pitch, family = "binomial", data = faa6)
summary(full_model_2)
## 
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft1 + pitch, 
##     family = "binomial", data = faa6)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.11589  -0.01116  -0.00026   0.00000   2.40741  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -67.92855   10.48408  -6.479 9.22e-11 ***
## speed_ground   0.61471    0.09184   6.694 2.18e-11 ***
## aircraft1      3.04348    0.73345   4.150 3.33e-05 ***
## pitch          1.06599    0.60389   1.765   0.0775 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 622.778  on 830  degrees of freedom
## Residual deviance:  81.309  on 827  degrees of freedom
## AIC: 89.309
## 
## Number of Fisher Scoring iterations: 10
BIC(full_model_2)
## [1] 108.1996
  • Pitch is not significant in this model, as speed_ground and aircraft1 can explain most of the data.
  • The AIC of this model is 89.309

Step6: Forward selection using AIC

  • step function is used for forward selection.
  • Here we create a full model and null model to set the scope of forward selection.
  • All missing rows are removed for this step to maintain the same number of rows in the models
faa7 <- na.omit(faa6)
full <- glm(long.landing~.-risky.landing-aircraft, family = "binomial", data = faa7)
nothing <- glm(long.landing~1, family = "binomial", data = faa7)
full.forward.AIC <- step(nothing,
                    scope = list(lower = formula(nothing),
                                 upper = formula(full)),
                    direction = "forward")
summary(full.forward.AIC)
## 
## Call:
## glm(formula = long.landing ~ speed_air + aircraft1 + height + 
##     pitch, family = "binomial", data = faa7)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.36696  -0.02116   0.00000   0.00098   1.64748  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -182.2356    49.9549  -3.648 0.000264 ***
## speed_air      1.5923     0.4285   3.716 0.000203 ***
## aircraft1      8.1444     2.3153   3.518 0.000435 ***
## height         0.3635     0.1142   3.184 0.001454 ** 
## pitch          1.5957     1.0958   1.456 0.145327    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 270.199  on 194  degrees of freedom
## Residual deviance:  34.278  on 190  degrees of freedom
## AIC: 44.278
## 
## Number of Fisher Scoring iterations: 10
BIC(full.forward.AIC)
## [1] 60.64335
  • pitch is still not significant in this model. However, the AIC value is 44.28, which is better than the model in step5.
  • Since the missing values are removed from the dataset, speed_air and height have become significant factors in the model.

Step7: Forward selection using BIC

full.forward.BIC <- step(nothing,
                         scope = list(lower = formula(nothing),
                                      upper = formula(full)),
                         direction = "forward",
                         k = log(nrow(faa7)))
summary(full.forward.BIC)
## 
## Call:
## glm(formula = long.landing ~ speed_air + aircraft1 + height, 
##     family = "binomial", data = faa7)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.50362  -0.04226   0.00000   0.00348   2.22995  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -150.64570   36.73061  -4.101 4.11e-05 ***
## speed_air      1.35748    0.32882   4.128 3.65e-05 ***
## aircraft1      7.42305    1.97283   3.763 0.000168 ***
## height         0.32871    0.09986   3.292 0.000996 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 270.199  on 194  degrees of freedom
## Residual deviance:  36.798  on 191  degrees of freedom
## AIC: 44.798
## 
## Number of Fisher Scoring iterations: 10
BIC(full.forward.BIC)
## [1] 57.89043
  • This model has all significant factors and pitch is not considered in the model.
  • The AIC value of this model is higher than that of the model in step6, however BIC is better for this model.

Step8: Presentation to FAA agent

  • Speed_air, aircraft and height are the most determining factors for long landings as we select the following model

long.landing ~ speed_air + aircraft1 + height

  • Since there are a lot of missing values in the data, the model is performed only on 195 observations that are complete.

* There are higher number of long landings for boeing aircraft and higher speed in the air.

##   variable names size of coef       odds ratio Direction
## 1      speed_air      1.35748  3.8863872554824  Positive
## 2      aircraft1      7.42305 1674.13183093398  Positive
## 3         height      0.32871 1.38917493643559  Positive

Identifying important factors for risky landing

Step9: Repeat steps 1-7 for risky landing

Histogram and pie chart for risky landing

hist(faa6$risky.landing, prob = TRUE,
     main = "Histogram with density curve")
lines(density(faa6$risky.landing), col = 4, lwd = 2)

risky_landing <- table(faa6$risky.landing)
labels <- paste0(c(0,1), " = ", round(100 * risky_landing/sum(risky_landing), 2), "%")

pie(risky_landing, labels = labels, main = "Pie chart of risky.landing")

* There is only 7.34% of observations that are risky landing in the data as noted from the pie chart and histogram.

Single factor regression with original values

table1_2_df <- data.frame(matrix(ncol=5,nrow=0))
colnames(table1_2_df) <- c("variable names", "size of coef", "odds ratio", "Direction", "pvalue")

col10 <- names(faa6)[-c(1,9,10)]
glm.test <- vector("list", length(col10))


for(i in seq_along(col10)){
  glm.test[[i]] <- glm(reformulate(col10[i], "risky.landing"), family = binomial, data = faa6)
  
  coef <- tidy(glm.test[[i]])[2,2]
  val <- case_when(
    coef >= 0  ~ "Positive",
    coef < 0  ~ "Negative"
  )
  table1_2_df[nrow(table1_2_df)+1,] = c(col10[i], abs(coef), exp(coef), val, as.numeric(summary(glm.test[[i]])$coefficients[2,4]))
}

table1_2_df <- table1_2_df %>% arrange(pvalue)
table1_2_df
##   variable names size of coef odds ratio Direction       pvalue
## 1   speed_ground  0.614218747  1.8482121  Positive 6.898006e-08
## 2      speed_air  0.870401902  2.3878703  Positive 3.728032e-06
## 3      aircraft1  1.001775330  2.7231120  Positive 4.560563e-04
## 4          pitch  0.371071969  1.4492874  Positive 1.432961e-01
## 5        no_pasg  0.025379344  0.9749400  Negative 1.536237e-01
## 6       duration  0.001151836  0.9988488  Negative 6.801987e-01
## 7         height  0.002218606  0.9977839  Negative 8.705917e-01
  • From the results table, it is evident that only speed_ground, speed_air and aircraft1 are significant in the model and have positive coefficients.
  • No_pasg, duration and height are not significant in the model and have negative coefficients.
  • Aircraft1 has the highest odds ratio followed by speed_air and speed_ground in the same order.

Single factor regression with standardized values

table2_2_df <- data.frame(matrix(ncol=5,nrow=0))
colnames(table2_2_df) <- c("variable names", "size of coef", "odds ratio", "Direction", "pvalue")

glm.test2 <- vector("list", length(col10))

for(i in seq_along(col10)){
  data1 <- faa6[,col10[i]]
  colnames(data1)[1] <- "x1"
  faa6["data"] <- (data1$x1 - mean(data1$x1, na.rm = T))/sd(data1$x1, na.rm = T)

  glm.test2[[i]] <- glm(reformulate("data", "risky.landing"), family = binomial, data = faa6)
  
  coef <- tidy(glm.test2[[i]])[2,2]
  val <- case_when(
    coef >= 0  ~ "Positive",
    coef < 0  ~ "Negative"
  )
  table2_2_df[nrow(table2_2_df)+1,] = c(col10[i], abs(coef), exp(coef), val, as.numeric(summary(glm.test2[[i]])$coefficients[2,4]))
}

table2_2_df <- table2_2_df %>% arrange(pvalue)
table2_2_df
##   variable names size of coef   odds ratio Direction       pvalue
## 1   speed_ground  11.50780308 9.948907e+04  Positive 6.898006e-08
## 2      speed_air   8.47447435 4.790904e+03  Positive 3.728032e-06
## 3      aircraft1   0.50000891 1.648736e+00  Positive 4.560563e-04
## 4          pitch   0.19539501 1.215791e+00  Positive 1.432961e-01
## 5        no_pasg   0.19012470 8.268560e-01  Negative 1.536237e-01
## 6       duration   0.05569118 9.458312e-01  Negative 6.801987e-01
## 7         height   0.02170864 9.785253e-01  Negative 8.705917e-01
  • Even after standardizing the data, the p-values and the order of the factors remain same across the tables. Hence, we rank the factors based on the p-value as seen in both table1 and table2 above.

Visualizations for significant factors

vars <- table1_2_df$`variable names`[table1_2_df$pvalue<0.05]
for(i in seq_along(vars)) {
  d <- subset (faa6, select = c(vars[i],"risky.landing"))
  par(mfrow = c(1,2))
  plot(jitter(d$risky.landing, 0.1)~jitter(d[[vars[i]]]),xlab = colnames(d)[1],ylab = "risky.landing",pch=".")
  
  d$risky.landing <- factor(d$risky.landing, levels = c(0, 1))
  
  plot(risky.landing ~ ., data = d)
  par(mfrow = c(1,1))
  print(ggplot(d,aes_string(x= colnames(d)[1], fill = "risky.landing", colour = "risky.landing")) +
    geom_histogram(position = "dodge",binwidth = 1))
}

  • speed_ground and speed_air have long.landing at higher values.
  • aircraft1 has higher number of long.landings occuring at a value of 1, which is boeing aircraft. This explains the higher odds ratio and coeffecient of aircraft1 in the model above.
  • These observations are consistent with long.landing except for pitch which is not significant in risky landing case.

Full model, based on the earlier results and removing multicollinearity

  • The most significant variables - speed_ground, aircraft1 are considered for the full model here.
  • Speed_air is discarded as it has a lot of missing values and is highly correlated with speed_ground
full_model_2 <- glm(risky.landing~speed_ground + aircraft1, family = "binomial", data = faa6)
summary(full_model_2)
## 
## Call:
## glm(formula = risky.landing ~ speed_ground + aircraft1, family = "binomial", 
##     data = faa6)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.24398  -0.00011   0.00000   0.00000   1.61021  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -102.0772    24.7751  -4.120 3.79e-05 ***
## speed_ground    0.9263     0.2248   4.121 3.78e-05 ***
## aircraft1       4.0190     1.2494   3.217   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 436.043  on 830  degrees of freedom
## Residual deviance:  40.097  on 828  degrees of freedom
## AIC: 46.097
## 
## Number of Fisher Scoring iterations: 12
BIC(full_model_2)
## [1] 60.26449
  • All the variables taken are significant in the model.
  • The AIC of this model is 46.097 and BIC is 60.26

Forward selection using AIC

faa7 <- na.omit(faa6)
full <- glm(risky.landing~.-long.landing-aircraft, family = "binomial", data = faa7)
nothing <- glm(risky.landing~1, family = "binomial", data = faa7)
full.forward.AIC1 <- step(nothing,
                    scope = list(lower = formula(nothing),
                                 upper = formula(full)),
                    direction = "forward")
summary(full.forward.AIC1)
## 
## Call:
## glm(formula = risky.landing ~ speed_air + aircraft1, family = "binomial", 
##     data = faa7)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.67140  -0.00714  -0.00063   0.00008   2.47432  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -133.7174    33.5105  -3.990 6.60e-05 ***
## speed_air      1.2206     0.3064   3.984 6.78e-05 ***
## aircraft1      4.5499     1.5124   3.008  0.00263 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 240.724  on 194  degrees of freedom
## Residual deviance:  26.281  on 192  degrees of freedom
## AIC: 32.281
## 
## Number of Fisher Scoring iterations: 9
BIC(full.forward.AIC1)
## [1] 42.09974
  • The AIC value is 32.281, which is better than the earlier model and all the variables in this model are significant
  • Since the missing values are removed from the dataset, speed_air and aircraft1 have become significant factors in the model.
  • The BIC of the model is 42.099 which is an improvement from the earlier model.

Forward selection using BIC

full.forward.BIC1 <- step(nothing,
                         scope = list(lower = formula(nothing),
                                      upper = formula(full)),
                         direction = "forward",
                         k = log(nrow(faa7)))
summary(full.forward.BIC1)
## 
## Call:
## glm(formula = risky.landing ~ speed_air + aircraft1, family = "binomial", 
##     data = faa7)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.67140  -0.00714  -0.00063   0.00008   2.47432  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -133.7174    33.5105  -3.990 6.60e-05 ***
## speed_air      1.2206     0.3064   3.984 6.78e-05 ***
## aircraft1      4.5499     1.5124   3.008  0.00263 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 240.724  on 194  degrees of freedom
## Residual deviance:  26.281  on 192  degrees of freedom
## AIC: 32.281
## 
## Number of Fisher Scoring iterations: 9
BIC(full.forward.BIC1)
## [1] 42.09974
  • One difference here from long.landing models is that the forward selection with both AIC and BIC criterion result in the same model with same AIC and BIC values.

Step10 - Presentation to FAA agent

  • Speed_air and aircraft are the most determining factors for risky landings and hence we select the following model

risky.landing ~ speed_air + aircraft1

  • Since there are a lot of missing values in the data, the model is performed only on 195 observations that are complete.

  • There are higher number of long landings for boeing aircraft and higher speed in the air.
##   variable names size of coef       odds ratio Direction
## 1      speed_air       1.2206 3.38922065607726  Positive
## 2      aircraft1       4.5499 94.6229455472389  Positive

Step11: Summary of differences between long landing and risky landing models

  • The standard error of the estimates for risky landing are lesser than that of long landing.
  • Pitch is not considered in risky landing in all the models, where are long landing has pitch in AIC forward selected model.

Step 12: ROC curve

preds=predict(full.forward.BIC)
roc1=roc(faa7$long.landing ~ preds)
plot(roc1)

preds2 <- predict(full.forward.BIC1)
roc2 <- roc(faa7$risky.landing ~ preds2)
plot(roc2, add=TRUE, col='red')

* Risky landing has a better ROC curve than long landing.

auc(roc1)
## Area under the curve: 0.9945
auc(roc2)
## Area under the curve: 0.9962
  • The area under curve for long landing is 0.9945 where as for risky landing is 0.9962, suggesting that the model for risky landing is more accurate than the model for long landing.

Step13: Predicting the probability of long landing and risky landing for new data

  • We use predict function with type = “response” to predict the probability of long landing and risky landing for new data (Boeing, duration=200, no_pasg=80, speed_ground=115, speed_air=120, height=40, pitch=4)
new.ind<-data.frame(aircraft="Boeing",duration=200,no_pasg=80,speed_ground=115,speed_air=120,height=40,pitch=4,aircraft1 = 1)

## Long landing probability prediction
predict(full.forward.BIC,newdata=new.ind,type="response") ### Probability

pred_bic <- predict(full.forward.BIC,newdata=new.ind,type="response",se=T)
c(pred_bic$fit[1]-1.96*as.numeric(pred_bic$se.fit[1]),pred_bic$fit[1]+1.96*as.numeric(pred_bic$se.fit[1]))

## risky landing probability prediction
predict(full.forward.BIC1,newdata=new.ind,type="response") ### Probability

pred_bic1 <- predict(full.forward.BIC1,newdata=new.ind,type="response",se=T)
c(pred_bic1$fit[1]-1.96*as.numeric(pred_bic1$se.fit[1]),pred_bic1$fit[1]+1.96*as.numeric(pred_bic1$se.fit[1]))
  • The probability of long landing and risky landing for the given data is 1, as speed_air is 120 which is at the higher end always results in long landing and risky landing.

  • The confidence interval for both the probabilities is from 0.9999997 to 1.0000002