Packages Used

library(readxl)   
library(knitr)     
library(dplyr)
library(tidyverse)
library(faraway)
library(ggplot2)

Variable Descriptions

Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

Steps 1-9 of Part-1 of the Project:

Step 1: Reading the datasets

Read the two files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ into your R system. Please search “Read Excel files from R” in Google in case you do not know how to do that.

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
setwd("C:/Users/renju/OneDrive/Documents/MSBANA/Spring/Spring Term 1/Statistical Modelling - 7042/Week 2")
# Read in Dataset 1
FAA1 <- read_xls("FAA1.xls")
# Read in Dataset 2
FAA2 <- read_xls("FAA2.xls")

Step 2: Let us look at the structure and dimension of the datasets and check whether the datasets datatypes and column names are appropriate or not.

## tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:800] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:800] 109 103 NA NA NA ...
##  $ height      : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:800] 3370 2988 1145 1664 1050 ...

We can observe that there are 800 observations an 8 variables in the faa1 dataset.

## tibble [150 × 7] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:150] 109 103 NA NA NA ...
##  $ height      : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:150] 3370 2988 1145 1664 1050 ...

We can observe that there are 150 observations an 7 variables in the faa2 dataset and we created an empty duration column to merge the datasets.

Step 3: Merging the datasets and checking for duplicate records.

## [1] 950   8

## [1] 100

we can observe that there are no duplicate observations in the merged dataset.

Step 4: Let us look at the structure of the merged dataset, dimension and summary statistics for each variable.

## tibble [850 × 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:850] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:850] 109 103 NA NA NA ...
##  $ height      : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:850] 3370 2988 1145 1664 1050 ...

We can observe that there are 850 observations and 8 variables in the merged dataset.

Summary statistics for each variable:

##    aircraft            duration         no_pasg      speed_ground   
##  Length:850         Min.   : 14.76   Min.   :29.0   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90  
##  Mode  :character   Median :153.95   Median :60.0   Median : 79.64  
##                     Mean   :154.01   Mean   :60.1   Mean   : 79.45  
##                     3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06  
##                     Max.   :305.62   Max.   :87.0   Max.   :141.22  
##                     NA's   :50                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.25   1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :101.15   Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :103.80   Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:109.40   3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :642

## [1] 850   8

we can observe that there are no duplicate observations in the merged dataset.

Let us look at the structure of the merged dataset, dimension and summary statistics for each variable. We can observe that there are 850 observations and 8 variables in the merged dataset.

Step 5:

There are few abnormal observations in few columns which can be removed further in our analysis.
There are few missing values in the columns - duration and speed_air.
The data is balanced as the proportion of data for both the aircrafts are almost the same.

Step 6:

Checking for abnormal values in the variables of the dataset and removing them.

## [1] 850   8

We have removed 23 observations by filtering out the observations with abnormal values.

Step 7:

Let us again look at the structure of the merged dataset, dimension and summary statistics for each variable.

## tibble [831 × 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : num [1:831] 0 0 0 0 0 0 0 0 0 0 ...
##  $ duration    : num [1:831] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:831] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:831] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:831] 109 103 NA NA NA ...
##  $ height      : num [1:831] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:831] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:831] 3370 2988 1145 1664 1050 ...

##     aircraft         duration         no_pasg       speed_ground   
##  Min.   :0.0000   Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
##  1st Qu.:0.0000   1st Qu.:119.63   1st Qu.:55.00   1st Qu.: 66.20  
##  Median :1.0000   Median :154.28   Median :60.00   Median : 79.79  
##  Mean   :0.5343   Mean   :154.78   Mean   :60.06   Mean   : 79.54  
##  3rd Qu.:1.0000   3rd Qu.:189.66   3rd Qu.:65.00   3rd Qu.: 91.91  
##  Max.   :1.0000   Max.   :305.62   Max.   :87.00   Max.   :132.78  
##                   NA's   :50                                       
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.23   1st Qu.:23.530   1st Qu.:3.640   1st Qu.: 893.28  
##  Median :101.12   Median :30.167   Median :4.001   Median :1262.15  
##  Mean   :103.48   Mean   :30.458   Mean   :4.005   Mean   :1522.48  
##  3rd Qu.:109.36   3rd Qu.:37.004   3rd Qu.:4.370   3rd Qu.:1936.63  
##  Max.   :132.91   Max.   :59.946   Max.   :5.927   Max.   :5381.96  
##  NA's   :628

Step 8: Let us plot histograms for all the numerical variables.

Step 9:

we can observe from the histograms that the columns - duration, no_pasg, Speed_ground, height and pitch are almost symmetric.
The speed_air column in right skewed from the histogram.
The distance column is also right skewed.

Part 2 of the project - Practice of modeling a binary response using logistic regression.

Create binary responses

Step 1

## # A tibble: 6 × 9
##   aircraft duration no_pasg speed_ground speed_air height pitch long.landing
##      <dbl>    <dbl>   <dbl>        <dbl>     <dbl>  <dbl> <dbl>        <dbl>
## 1        0     98.5      53        108.       109.   27.4  4.04            1
## 2        0    126.       69        102.       103.   27.8  4.12            1
## 3        0    112.       61         71.1       NA    18.6  4.43            0
## 4        0    197.       56         85.8       NA    30.7  3.88            0
## 5        0     90.1      70         59.9       NA    32.4  4.03            0
## 6        0    138.       55         75.0       NA    41.2  4.20            0
## # ℹ 1 more variable: risky.landing <dbl>

We have created the two binary variables, long.landing and risky.landing and discarded the distance variable from the cleaned dataset.

The binary variables long.landing and risky.landing provide a simplified representation of landing safety based on the distance criteria. These variables will be useful for predictive analysis and classification tasks. However, the dataset contains missing values in the speed_air column, which might impact the analysis if not handled appropriately. Depending on the requirements, these missing values could be addressed through imputation or by excluding the affected rows or columns.

The removal of the continuous distance variable ensures that only the binary outcomes related to landing safety are analyzed. This is in line with the project requirements and simplifies the dataset for focused analysis on landing conditions.

Identifying important factors using the binary data of “long.landing”.

Step 2

We can see that the majority of the long.landing is below 2500 when compared to above 2500.

The histogram illustrates the distribution of the binary variable long.landing, where the majority of flights fall into the category long.landing = 0, indicating that most flights do not exceed the threshold of 2500 feet. A significantly smaller number of flights are classified as long.landing = 1, representing long landings. This suggests a class imbalance in the data, which could impact subsequent analysis or predictive modeling. The imbalance highlights the need for further investigation into the features (e.g., speed_ground, height, pitch) that correlate with long landings to understand the underlying factors. Additionally, for predictive modeling, methods such as oversampling, undersampling, or adjusting class weights may be required to address the imbalance effectively. Supplementary visualizations, such as a pie chart, could also provide a clearer view of the proportions in the dataset.We can see that the majority of the long.landing is below 2500 when compared to above 2500.

Step 3

## [1] "aircraft"      "duration"      "no_pasg"       "speed_ground" 
## [5] "speed_air"     "height"        "pitch"         "long.landing" 
## [9] "risky.landing"

##         variable      p-value reg_coef_direction
## lv4 speed_ground 3.935339e-14           positive
## lv5    speed_air 4.334124e-11           positive
## lv1     aircraft 8.398591e-05           negative
## lv7        pitch 4.664982e-02           positive
## lv6       height 4.218576e-01           positive
## lv3      no_pasg 6.058565e-01           negative
## lv2     duration 6.305122e-01           negative

##      aircraft      duration       no_pasg  speed_ground     speed_air 
## -4.969951e-17 -8.871411e-17  2.040555e-16 -3.188666e-16  6.767302e-16 
##        height         pitch 
##  5.488281e-17  4.857226e-16

##     aircraft     duration      no_pasg speed_ground    speed_air       height 
##            1            1            1            1            1            1 
##        pitch 
##            1

##          variable    Estimate   odds_ratio reg_coef_direction     Pr(>|z|)
## lv11 speed_ground  8.84971670 6972.4134377           positive 3.935339e-14
## lv12    speed_air  4.98810682  146.6585091           positive 4.334124e-11
## lv8      aircraft -0.43130192    0.6496627           negative 8.398591e-05
## lv14        pitch  0.21090555    1.2347957           positive 4.664982e-02
## lv13       height  0.08438418    1.0880468           positive 4.218576e-01
## lv10      no_pasg -0.05436003    0.9470911           negative 6.058565e-01
## lv9      duration -0.05175818    0.9495585           negative 6.305122e-01

From the above table, we can observe that the three most significant factors are speed_ground, speed_air and aircraft by p-value.

The logistic regression analysis identifies the relationship between long.landing (a binary response variable) and potential predictors, including aircraft, duration, no_pasg, speed_ground, speed_air, height, and pitch. The results provide regression coefficients, odds ratios, the direction of the coefficients (positive or negative), and p-values for each predictor. Variables with smaller p-values are more statistically significant and ranked higher in their contribution to predicting long landings. The odds ratios reveal the magnitude of change in the likelihood of a long landing for a unit change in each predictor, while the direction of the coefficients indicates whether the predictor increases or decreases the probability of a long landing. Based on the findings, the most significant predictors should be prioritized for further analysis, such as multivariable logistic regression or interaction effect exploration. Additionally, practical steps can be taken to mitigate the risk of long landings by focusing on modifiable factors such as speed_ground and height. Standardizing variables ensures consistency across predictors, aiding in model interpretation and refinement. From the above table, we can observe that the three most significant factors are speed_ground, speed_air and aircraft by p-value.

Step 4

## 
## Call:
## glm(formula = long.landing ~ speed_ground + speed_air + aircraft, 
##     family = binomial, data = stdz_FAA.l1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    5.4480     5.2625   1.035  0.30056    
## speed_ground  -3.0381     4.1011  -0.741  0.45882    
## speed_air      8.6033     2.6413   3.257  0.00113 ** 
## aircraft      -1.8993     0.4295  -4.422 9.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 281.373  on 202  degrees of freedom
## Residual deviance:  69.013  on 199  degrees of freedom
##   (628 observations deleted due to missingness)
## AIC: 77.013
## 
## Number of Fisher Scoring iterations: 8

The visualizations highlight the relationship between long.landing and significant predictors, including speed_ground, speed_air, aircraft, and pitch. From the regression model, the coefficient for speed_air is 8.6033 (p-value = 0.0011), indicating a strong positive association between higher airspeed and the likelihood of long landings (long.landing = 1). However, speed_ground has a coefficient of -3.0381 (p-value = 0.4582), suggesting no statistically significant relationship with long landings in this model. The aircraft variable has a coefficient of -1.8993 (p-value < 9.78e-06), reflecting a significant negative association, where certain aircraft characteristics may reduce the probability of long landings. The intercept is 5.4480 (p-value = 0.3006), representing the baseline log odds of a long landing when all predictors are zero. These findings emphasize that managing speed_air and understanding the impact of aircraft type are critical for reducing long landings. Additional exploration of interaction effects between predictors, such as speed_ground and speed_air, may provide deeper insights into their combined influence. The model’s residual deviance of 69.013 and AIC of 77.013 indicate a reasonable fit, but further validation on unseen data is necessary to confirm generalizability.

Step 5

## 
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft, family = binomial, 
##     data = FAA_l_full)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -12.7036     1.7714  -7.172 7.42e-13 ***
## speed_ground  10.9667     1.5815   6.934 4.08e-12 ***
## aircraft      -1.6156     0.3553  -4.547 5.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 622.778  on 830  degrees of freedom
## Residual deviance:  84.665  on 828  degrees of freedom
## AIC: 90.665
## 
## Number of Fisher Scoring iterations: 10

Based on the results from Steps 3-4 and the collinearity analysis in Step 16 (Part 1), speed_ground and aircraft are included in the full model. However, speed_air is excluded due to its high collinearity with speed_ground and the presence of numerous missing values, which could negatively impact both AIC and BIC. Although speed_air is statistically significant, these factors make it unsuitable for inclusion in the final model.

The full logistic regression model, incorporating speed_ground and aircraft as predictors, highlights their significant role in predicting long.landing. The results show that higher speed_ground is positively associated with an increased likelihood of long landings, as indicated by a coefficient of 10.9667 (p-value < 4.08e-12). This finding emphasizes that controlling ground speed is crucial to reducing risk. Conversely, the aircraft variable has a negative coefficient of -1.6156 (p-value < 5.45e-06), suggesting that certain aircraft characteristics might reduce the probability of long landings, potentially due to design or operational differences. The intercept is -12.7036 (p-value < 7.42e-13), representing the baseline log odds of a long landing when all predictors are at zero. The model achieves a residual deviance of 84.665 on 828 degrees of freedom and an AIC of 90.665, indicating a good model fit. Both predictors are statistically significant with small p-values, suggesting a strong and reliable relationship with the response variable. This model effectively addresses collinearity concerns and provides a solid foundation for predictive analysis. Next steps include validating the model on unseen data to assess its generalizability and exploring potential interaction effects between the predictors for further refinement.

Step 6

## 
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft + height + 
##     pitch, family = binomial, data = stdz_FAA.l1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -22.0249     4.3431  -5.071 3.95e-07 ***
## speed_ground  19.1602     3.8014   5.040 4.65e-07 ***
## aircraft      -2.5627     0.5894  -4.348 1.37e-05 ***
## height         2.5240     0.6713   3.760  0.00017 ***
## pitch          0.8096     0.4429   1.828  0.06755 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 622.778  on 830  degrees of freedom
## Residual deviance:  53.204  on 826  degrees of freedom
## AIC: 63.204
## 
## Number of Fisher Scoring iterations: 12

The results of forward variable selection using AIC indicate that speed_ground, aircraft, and height are significant predictors, with pitch also included in the model. However, when compared to the table in Step 3, pitch appears to be more significant than height based on its p-value, suggesting that it plays a stronger role in the model than initially observed.

The logistic regression model fitted with forward selection includes the predictors speed_ground, aircraft, height, and pitch to predict long.landing. The model summary shows the following:
speed_ground: The coefficient is 19.1602 (p-value < 4.65e-07), indicating a strong and statistically significant positive relationship between ground speed and the likelihood of a long landing.
aircraft: The coefficient is -2.5627 (p-value < 1.37e-05), suggesting that certain aircraft types significantly reduce the probability of a long landing.
height: The coefficient is 2.5240 (p-value < 0.00017), showing a significant positive association between height and the likelihood of a long landing.
pitch: The coefficient is 0.8096 (p-value = 0.06755), indicating a weaker, marginally significant relationship between pitch and long landings.
The intercept is -22.0249 (p-value < 3.95e-07), representing the baseline log odds when all predictors are zero.

The residual deviance is 53.204 on 826 degrees of freedom, and the AIC is 63.204, suggesting a good model fit.

Step 7

## 
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft + height, 
##     family = binomial, data = stdz_FAA.l1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -19.8638     3.6399  -5.457 4.84e-08 ***
## speed_ground  17.3600     3.2304   5.374 7.70e-08 ***
## aircraft      -2.5196     0.5566  -4.527 5.99e-06 ***
## height         2.2609     0.5831   3.877 0.000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 622.778  on 830  degrees of freedom
## Residual deviance:  57.047  on 827  degrees of freedom
## AIC: 65.047
## 
## Number of Fisher Scoring iterations: 11

Based on the results of forward variable selection using BIC, the significant factors included in the model are speed_ground, aircraft, and height. However, when comparing this to the table in Step 3, we observe that pitch appears more significant than height based on its p-value. Additionally, compared to the AIC model results, the BIC model excludes the pitch variable. This is because BIC tends to favor simpler models with fewer variables compared to AIC, which allows for more complexity.

The logistic regression model selected using BIC includes speed_ground, aircraft, and height as significant predictors of long.landing, while excluding pitch, despite its significance in earlier steps. All included variables are highly significant (p < 0.001), with speed_ground having the strongest effect. The model shows a substantial reduction in deviance (Null Deviance: 622.778 → Residual Deviance: 57.047) and a low AIC (65.047), indicating strong predictive performance. BIC’s preference for simpler models led to the exclusion of pitch, favoring a more parsimonious model. While this model is efficient, an AIC-based selection may be considered if slightly better predictive performance is preferred over simplicity.

Step 8

• BIC-selected logistic regression model chosen for its balance between simplicity and accuracy.
• Higher speeds increase the risk of long landings (Odds Ratio: 94989.07).Certain aircraft types reduce the risk (Odds Ratio: 0.61).Higher altitude at the threshold increases the risk (Odds Ratio: 2.34).Higher speed_ground and altitude increase the risk of long landings, while certain aircraft types reduce it.
• Visuals: Graphs will illustrate how speed_ground increases long landing risk, how different aircraft types affect landing outcomes, and the combined impact of speed and aircraft type on long landing probability.
• Pitch was excluded from the model due to lack of statistical significance.

I have selected the BIC model for analyzing long landings as it provides a simpler, less complex representation compared to other models. The pitch variable is not included in the final model since it lacks statistical significance in the BIC selection, and no clear trend was observed in the association graph. The analysis indicates that an increase in ground speed significantly raises the risk of long landing. While air speed also contributes to this risk, it is excluded from the model due to collinearity and missing values, which could impact the AIC-based selection. 1. BIC-selected logistic regression model for long landings, as it balances simplicity and predictive accuracy. 2.

Identifying important factors using the binary data of “risky.landing”.

Step 9

Performing single factor regression analysis for each of the potential risk factors.

##         variable      p-value reg_coef_direction
## vr4 speed_ground 6.898006e-08           positive
## vr5    speed_air 3.728032e-06           positive
## vr1     aircraft 4.560563e-04           negative
## vr7        pitch 1.432961e-01           positive
## vr3      no_pasg 1.536237e-01           negative
## vr2     duration 6.801987e-01           negative
## vr6       height 8.705917e-01           negative

##      aircraft      duration       no_pasg  speed_ground     speed_air 
## -4.969951e-17 -8.871411e-17  2.040555e-16 -3.188666e-16  6.767302e-16 
##        height         pitch 
##  5.488281e-17  4.857226e-16

##     aircraft     duration      no_pasg speed_ground    speed_air       height 
##            1            1            1            1            1            1 
##        pitch 
##            1

##          variable    Estimate   odds_ratio reg_coef_direction     Pr(>|z|)
## vr11 speed_ground 11.50780308 9.948907e+04           positive 6.898006e-08
## vr12    speed_air  8.47447435 4.790904e+03           positive 3.728032e-06
## vr8      aircraft -0.50000891 6.065253e-01           negative 4.560563e-04
## vr14        pitch  0.19539501 1.215791e+00           positive 1.432961e-01
## vr10      no_pasg -0.19012470 8.268560e-01           negative 1.536237e-01
## vr9      duration -0.05569118 9.458312e-01           negative 6.801987e-01
## vr13       height -0.02170864 9.785253e-01           negative 8.705917e-01

## 
## Call:
## glm(formula = risky.landing ~ speed_ground + speed_air + aircraft, 
##     family = binomial, data = stdz_FAA.r1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -6.4105     8.6503  -0.741  0.45865   
## speed_ground   0.8296     6.3432   0.131  0.89594   
## speed_air     11.5776     3.9110   2.960  0.00307 **
## aircraft      -2.3057     0.7884  -2.924  0.00345 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 248.180  on 202  degrees of freedom
## Residual deviance:  26.279  on 199  degrees of freedom
##   (628 observations deleted due to missingness)
## AIC: 34.279
## 
## Number of Fisher Scoring iterations: 10

## 
## Call:
## glm(formula = risky.landing ~ ., family = binomial, data = FAA_r_full)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -26.5272     6.4611  -4.106 4.03e-05 ***
## aircraft      -2.0060     0.6236  -3.217   0.0013 ** 
## speed_ground  17.3544     4.2116   4.121 3.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 436.043  on 830  degrees of freedom
## Residual deviance:  40.097  on 828  degrees of freedom
## AIC: 46.097
## 
## Number of Fisher Scoring iterations: 12

## 
## Call:
## glm(formula = risky.landing ~ speed_ground + aircraft + no_pasg, 
##     family = binomial, data = stdz_FAA.r1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -27.2919     6.7889  -4.020 5.82e-05 ***
## speed_ground  17.7920     4.4139   4.031 5.56e-05 ***
## aircraft      -2.3169     0.7363  -3.147  0.00165 ** 
## no_pasg       -0.6339     0.4294  -1.476  0.13987    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 436.043  on 830  degrees of freedom
## Residual deviance:  37.707  on 827  degrees of freedom
## AIC: 45.707
## 
## Number of Fisher Scoring iterations: 12

## 
## Call:
## glm(formula = risky.landing ~ speed_ground + aircraft, family = binomial, 
##     data = stdz_FAA.r1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -26.5272     6.4611  -4.106 4.03e-05 ***
## speed_ground  17.3544     4.2116   4.121 3.78e-05 ***
## aircraft      -2.0060     0.6236  -3.217   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 436.043  on 830  degrees of freedom
## Residual deviance:  40.097  on 828  degrees of freedom
## AIC: 46.097
## 
## Number of Fisher Scoring iterations: 12

1. Single Factor Logistic Regression Analysis

The single factor regression analysis results indicate that speed_ground, speed_air, and aircraft have very low p-values (<0.001), suggesting strong statistical significance in predicting risky.landing. On the other hand, pitch, no_pasg, duration, and height have high p-values (>0.05), indicating that they are not statistically significant predictors.

Regarding the direction of the regression coefficients, speed_ground, speed_air, and pitch have positive coefficients, meaning they increase the likelihood of risky landing. Conversely, aircraft, no_pasg, duration, and height have negative coefficients, suggesting that they reduce the likelihood of risky landing.

Given these findings, speed_ground, speed_air, and aircraft should be retained for further analysis, while pitch, no_pasg, duration, and height should be excluded due to their lack of statistical significance.

2. Multiple Logistic Regression Models

When multiple logistic regression models are considered, the model including speed_ground, speed_air, and aircraft shows that speed_air and aircraft remain statistically significant (p < 0.01), while speed_ground has a high standard error, indicating potential multicollinearity. The low residual deviance (26.279) suggests a well-fitting model.

A different model incorporating speed_ground, aircraft, and no_pasg retains speed_ground and aircraft as significant predictors, while no_pasg remains insignificant (p = 0.13987), meaning it does not add much predictive value. The AIC value for this model is 45.707, indicating a slight improvement in model selection.

Finally, the model with only speed_ground and aircraft is found to be the best-performing model. It has highly significant predictors, the lowest AIC (46.097), and a low residual deviance (40.097), indicating a strong model fit. Based on these findings, the model with speed_ground and aircraft is the preferred choice for predicting risky.landing.

3. Model Selection Using AIC and BIC

The AIC-based model selection includes speed_ground, speed_air, and aircraft and has the lowest AIC value (34.279), suggesting it has strong predictive ability. Meanwhile, the BIC-based model selects only speed_ground and aircraft, dropping speed_air for a simpler model with an AIC of 45.707.

Since AIC tends to favor more complex models while BIC prioritizes simplicity, the BIC-selected model (speed_ground and aircraft) is the better choice to avoid overfitting while maintaining strong predictive power.

The final analysis identifies speed_ground and aircraft as the most significant predictors of risky.landing. The BIC-selected model is preferred due to its simplicity and interpretability, while still maintaining good model performance.

Step 10

Risk Factors for Risky Landings and Their Influence on Occurrence – FAA Agent Report

I have selected the BIC model for analyzing risky landings, as it offers a less complex representation compared to other models. The variable no_pasg does not hold significant importance in the BIC model when compared to the AIC model and the full model, leading to its exclusion. The analysis indicates that an increase in ground speed significantly raises the risk of a risky landing. While air speed also contributes to this risk, it is excluded from the final model due to collinearity and missing values, which could impact AIC-based selection. The table below presents a comparison of the models built for long landings and risky landings.

Higher landing speed greatly increases the risk of a risky landing, with even small speed increases dramatically raising the odds.
Certain aircraft types are associated with lower risk, meaning some planes handle landings better than others.
A table summarizing the regression results will highlight speed and aircraft type as the key risk factors.
Three visuals will show how risky landings increase with speed, vary by aircraft type, and the predicted probability of risk based on both factors.
Controlling landing speed and optimizing aircraft-specific landing protocols can significantly reduce risky landings and improve safety.

Compare the two models built for “long.landing” and “risky.landing”

Step 11

## 
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft + height, 
##     family = binomial, data = stdz_FAA.l1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -19.8638     3.6399  -5.457 4.84e-08 ***
## speed_ground  17.3600     3.2304   5.374 7.70e-08 ***
## aircraft      -2.5196     0.5566  -4.527 5.99e-06 ***
## height         2.2609     0.5831   3.877 0.000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 622.778  on 830  degrees of freedom
## Residual deviance:  57.047  on 827  degrees of freedom
## AIC: 65.047
## 
## Number of Fisher Scoring iterations: 11

## 
## Call:
## glm(formula = risky.landing ~ speed_ground + aircraft, family = binomial, 
##     data = stdz_FAA.r1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -26.5272     6.4611  -4.106 4.03e-05 ***
## speed_ground  17.3544     4.2116   4.121 3.78e-05 ***
## aircraft      -2.0060     0.6236  -3.217   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 436.043  on 830  degrees of freedom
## Residual deviance:  40.097  on 828  degrees of freedom
## AIC: 46.097
## 
## Number of Fisher Scoring iterations: 12

Speed_ground and aircraft are the two most significant factors in both models, showing a consistent influence on landing outcomes.
Height is included as a predictor in the BIC model for long landings but not in the risky landing model, indicating a difference in contributing factors.
The risky landing model has a lower AIC than the long landing model, suggesting it provides a relatively better fit for predicting risky landings.

Step 12

The ROC curves for the long landing and risky landing models indicate that the area under the curve (AUC) is higher for the risky landing model, suggesting better predictive performance. This implies that the risky landing model provides a better fit compared to the long landing model, making it more effective in distinguishing between outcomes.

Step 13

## 
## --- Prediction Results ---

## Long Landing Probability (95% CI): 1 - 1

## Risky Landing Probability (95% CI): 0.999 - 1.001

The prediction results indicate that the probability of a long landing is 100% (95% CI: 1 - 1), meaning the model is completely certain that the given airplane will experience a long landing. Similarly, the probability of a risky landing is extremely high at ~100% (95% CI: 0.999 - 1.001), suggesting that under these conditions, the aircraft is almost guaranteed to have a risky landing.

Step 14

## 
## Call:  glm(formula = risky.landing ~ speed_ground + aircraft, family = binomial(link = probit), 
##     data = stdz_FAA.r1)
## 
## Coefficients:
##  (Intercept)  speed_ground      aircraft  
##      -15.259         9.972        -1.176  
## 
## Degrees of Freedom: 830 Total (i.e. Null);  828 Residual
## Null Deviance:       436 
## Residual Deviance: 39.44     AIC: 45.44

## 
## Call:  glm(formula = risky.landing ~ speed_ground + aircraft, family = binomial(link = cloglog), 
##     data = stdz_FAA.r1)
## 
## Coefficients:
##  (Intercept)  speed_ground      aircraft  
##      -18.436        11.655        -1.447  
## 
## Degrees of Freedom: 830 Total (i.e. Null);  828 Residual
## Null Deviance:       436 
## Residual Deviance: 41.44     AIC: 47.44

## 
## Call:  glm(formula = risky.landing ~ speed_ground + aircraft, family = binomial, 
##     data = stdz_FAA.r1)
## 
## Coefficients:
##  (Intercept)  speed_ground      aircraft  
##      -26.527        17.354        -2.006  
## 
## Degrees of Freedom: 830 Total (i.e. Null);  828 Residual
## Null Deviance:       436 
## Residual Deviance: 40.1  AIC: 46.1

The output presents the results of three different generalized linear models (GLMs) using Probit, Cloglog, and another logistic model for predicting risky landings. All three models include speed_ground and aircraft as predictors. The coefficients for speed_ground and aircraft have the same direction across all models, meaning that their influence on the probability of a risky landing is consistent. The Probit and Cloglog models have similar coefficient magnitudes, while the logistic model has slightly higher values.

Additionally, the Akaike Information Criterion (AIC) values for all three models are relatively close - 45.44 for Probit, 47.44 for Cloglog, and 46.1 for the logistic model. The residual deviances are also similar, suggesting that all three models have comparable goodness-of-fit.

Since all three models identify speed_ground and aircraft as significant risk factors for a risky landing, these variables should be given priority in further analysis and risk mitigation strategies. The close AIC values indicate that no model overwhelmingly outperforms the others, but the Probit model has the lowest AIC, suggesting a slightly better fit.

Step 15

The ROC curves for the three models—Probit, Cloglog, and Model R—illustrate their classification performance. All three models demonstrate a low false positive rate (1 - Specificity), indicating strong predictive ability. Among them, the Cloglog model appears to have a slightly higher Area Under the Curve (AUC) compared to the other two models. A higher AUC suggests that the model is better at distinguishing between the positive and negative classes. Additionally, all models exhibit a steep increase in sensitivity at low false positive rates, further reinforcing their effectiveness.

Since the Cloglog model has the highest AUC, it is the most suitable model for making predictions. This suggests that the hazard model (Cloglog) provides better classification performance while minimizing false positives. Based on this observation, the Cloglog model should be prioritized for further analysis and potential deployment. However, additional evaluation using other metrics such as precision, recall, and the F1-score would be beneficial to ensure a comprehensive assessment of its reliability. If interpretability is a key concern, further model diagnostics should be performed before finalizing the decision.

Step 16

##  Top 5 Risky Landings Based on Predicted Probabilities 
##

## 
##  Model 1: Risky Landings

## 362 307  64 387 408 
##   1   1   1   1   1

## 
##  Model 2: Risky Landings

##  56  64 134 176 179 
##   1   1   1   1   1

## 
##  Model 3: Risky Landings

## 19 29 30 56 64 
##  1  1  1  1  1

We can observe that 64th observation appears in all the three models, where as 56th observation appears in probit and hazard model.

The output displays the top five most risky landings based on predicted probabilities for three different models. Each model (Model 1, Model 2, and Model 3) identifies a unique set of high-risk observations. However, there are notable overlaps among the models. Specifically, the 64th observation appears in all three models, indicating that it is consistently classified as a high-risk landing across different modeling approaches. Additionally, the 56th observation appears in both the Probit and Hazard models, suggesting a strong agreement between these two models regarding its risk level.

The consistency of the 64th observation across all models suggests that this particular case is highly risky and should be given priority in further analysis. Similarly, the repeated appearance of the 56th observation in two models reinforces its classification as a potentially high-risk event. Moving forward, these high-risk observations should be further investigated to understand their underlying causes. Additional validation using real-world data or expert domain knowledge may help confirm their risk level. If these observations represent actual risky landing scenarios, targeted interventions or safety measures should be considered to mitigate potential risks.

Step 17

##         1         1 
## 0.9999976 1.0000020

## 1 1 
## 1 1

The output presents predicted probabilities from both the Probit and Cloglog models for a given new observation with specific characteristics. The predicted probabilities for both models are extremely close to 1 (0.9999976 and 1.0000020), indicating a very high likelihood of a risky landing. Additionally, the confidence intervals for both predictions suggest that the classification is highly certain. The final classification (binary outcome) for both models is 1, reinforcing the conclusion that the given input conditions strongly indicate a risky landing.
Both models consistently predict that the given conditions will result in a risky landing with near certainty. This suggests that the input factors—such as speed, height, and aircraft type—strongly contribute to the risk classification. Given this result, it would be important to analyze whether these threshold conditions align with historical risky landings and if preventive measures could be implemented. Further investigation could include evaluating how changes in these input parameters (e.g., reducing speed or adjusting height) affect the probability of a risky landing. If applicable, this could inform decision-making for flight operations, pilot training, or automated landing assistance systems to minimize potential risks.

FAA Project Part 2a, Part 2b and Part 2c

Silpa Prakash Rao (M16141545)