Packages Used
Variable Descriptions
Aircraft: The make of an aircraft (Boeing or
Airbus).
Duration (in minutes): Flight duration
between taking off and landing. The duration of a normal flight should
always be greater than 40min.
No_pasg: The number
of passengers in a flight.
Speed_ground (in miles per
hour): The ground speed of an aircraft when passing over the
threshold of the runway. If its value is less than 30MPH or greater than
140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an
aircraft when passing over the threshold of the runway. If its value is
less than 30MPH or greater than 140MPH, then the landing would be
considered as abnormal.
Height (in meters): The
height of an aircraft when it is passing over the threshold of the
runway. The landing aircraft is required to be at least 6 meters high at
the threshold of the runway.
Pitch (in degrees):
Pitch angle of an aircraft when it is passing over the threshold of the
runway.
Distance (in feet): The landing distance
of an aircraft. More specifically, it refers to the distance between the
threshold of the runway and the point where the aircraft can be fully
stopped. The length of the airport runway is typically less than 6000
feet.
Steps 1-9 of Part-1 of the Project:
Step 1: Reading the datasets
Read the two files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ into your R system. Please search “Read Excel files from R” in Google in case you do not know how to do that.
Step 2: Let us look at the structure and dimension of the datasets and check whether the datasets datatypes and column names are appropriate or not.
## tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:800] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:800] 109 103 NA NA NA ...
## $ height : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:800] 3370 2988 1145 1664 1050 ...
We can observe that there are 800 observations an 8 variables in the faa1 dataset.
## tibble [150 × 7] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:150] 109 103 NA NA NA ...
## $ height : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:150] 3370 2988 1145 1664 1050 ...
We can observe that there are 150 observations an 7 variables in the faa2 dataset and we created an empty duration column to merge the datasets.
Step 3: Merging the datasets and checking for duplicate records.
## [1] 950 8
## [1] 100
we can observe that there are no duplicate observations in the merged dataset.
Step 4: Let us look at the structure of the merged dataset, dimension and summary statistics for each variable.
## tibble [850 × 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:850] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:850] 109 103 NA NA NA ...
## $ height : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:850] 3370 2988 1145 1664 1050 ...
We can observe that there are 850 observations and 8 variables in the merged dataset.
Summary statistics for each variable:
## aircraft duration no_pasg speed_ground
## Length:850 Min. : 14.76 Min. :29.0 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90
## Mode :character Median :153.95 Median :60.0 Median : 79.64
## Mean :154.01 Mean :60.1 Mean : 79.45
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06
## Max. :305.62 Max. :87.0 Max. :141.22
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :101.15 Median :30.093 Median :4.008 Median :1258.09
## Mean :103.80 Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :642
## [1] 850 8
we can observe that there are no duplicate observations in the merged dataset.
Let us look at the structure of the merged dataset, dimension and summary statistics for each variable. We can observe that there are 850 observations and 8 variables in the merged dataset.
Step 5:
- There are few abnormal observations in few columns which can be removed further in our analysis.
- There are few missing values in the columns - duration and speed_air.
- The data is balanced as the proportion of data for both the aircrafts are almost the same.
Step 6:
Checking for abnormal values in the variables of the dataset and removing them.
## [1] 850 8
We have removed 23 observations by filtering out the observations with abnormal values.
Step 7:
Let us again look at the structure of the merged dataset, dimension and summary statistics for each variable.
## tibble [831 × 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : num [1:831] 0 0 0 0 0 0 0 0 0 0 ...
## $ duration : num [1:831] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:831] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:831] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:831] 109 103 NA NA NA ...
## $ height : num [1:831] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:831] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:831] 3370 2988 1145 1664 1050 ...
## aircraft duration no_pasg speed_ground
## Min. :0.0000 Min. : 41.95 Min. :29.00 Min. : 33.57
## 1st Qu.:0.0000 1st Qu.:119.63 1st Qu.:55.00 1st Qu.: 66.20
## Median :1.0000 Median :154.28 Median :60.00 Median : 79.79
## Mean :0.5343 Mean :154.78 Mean :60.06 Mean : 79.54
## 3rd Qu.:1.0000 3rd Qu.:189.66 3rd Qu.:65.00 3rd Qu.: 91.91
## Max. :1.0000 Max. :305.62 Max. :87.00 Max. :132.78
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.23 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.28
## Median :101.12 Median :30.167 Median :4.001 Median :1262.15
## Mean :103.48 Mean :30.458 Mean :4.005 Mean :1522.48
## 3rd Qu.:109.36 3rd Qu.:37.004 3rd Qu.:4.370 3rd Qu.:1936.63
## Max. :132.91 Max. :59.946 Max. :5.927 Max. :5381.96
## NA's :628
Step 8: Let us plot histograms for all the numerical variables.
Step 9:
- we can observe from the histograms that the columns - duration, no_pasg, Speed_ground, height and pitch are almost symmetric.
- The speed_air column in right skewed from the histogram.
- The distance column is also right skewed.
Part 2 of the project - Practice of modeling a binary response using logistic regression.
Create binary responses
Step 1
## # A tibble: 6 × 9
## aircraft duration no_pasg speed_ground speed_air height pitch long.landing
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 98.5 53 108. 109. 27.4 4.04 1
## 2 0 126. 69 102. 103. 27.8 4.12 1
## 3 0 112. 61 71.1 NA 18.6 4.43 0
## 4 0 197. 56 85.8 NA 30.7 3.88 0
## 5 0 90.1 70 59.9 NA 32.4 4.03 0
## 6 0 138. 55 75.0 NA 41.2 4.20 0
## # ℹ 1 more variable: risky.landing <dbl>
We have created the two binary variables, long.landing and risky.landing and discarded the distance variable from the cleaned dataset.
The binary variables long.landing and risky.landing provide a simplified representation of landing safety based on the distance criteria. These variables will be useful for predictive analysis and classification tasks. However, the dataset contains missing values in the speed_air column, which might impact the analysis if not handled appropriately. Depending on the requirements, these missing values could be addressed through imputation or by excluding the affected rows or columns.
The removal of the continuous distance variable ensures that only the binary outcomes related to landing safety are analyzed. This is in line with the project requirements and simplifies the dataset for focused analysis on landing conditions.
Identifying important factors using the binary data of “long.landing”.
Step 2
We can see that the majority of the long.landing is below 2500 when compared to above 2500.
The histogram illustrates the distribution of the binary variable
long.landing
, where the majority of flights fall into the
category long.landing = 0
, indicating that most flights do
not exceed the threshold of 2500 feet. A significantly smaller number of
flights are classified as long.landing = 1
, representing
long landings. This suggests a class imbalance in the data, which could
impact subsequent analysis or predictive modeling. The imbalance
highlights the need for further investigation into the features (e.g.,
speed_ground
, height
, pitch
) that
correlate with long landings to understand the underlying factors.
Additionally, for predictive modeling, methods such as oversampling,
undersampling, or adjusting class weights may be required to address the
imbalance effectively. Supplementary visualizations, such as a pie
chart, could also provide a clearer view of the proportions in the
dataset.We can see that the majority of the long.landing is below 2500
when compared to above 2500.
Step 3
## [1] "aircraft" "duration" "no_pasg" "speed_ground"
## [5] "speed_air" "height" "pitch" "long.landing"
## [9] "risky.landing"
## variable p-value reg_coef_direction
## lv4 speed_ground 3.935339e-14 positive
## lv5 speed_air 4.334124e-11 positive
## lv1 aircraft 8.398591e-05 negative
## lv7 pitch 4.664982e-02 positive
## lv6 height 4.218576e-01 positive
## lv3 no_pasg 6.058565e-01 negative
## lv2 duration 6.305122e-01 negative
## aircraft duration no_pasg speed_ground speed_air
## -4.969951e-17 -8.871411e-17 2.040555e-16 -3.188666e-16 6.767302e-16
## height pitch
## 5.488281e-17 4.857226e-16
## aircraft duration no_pasg speed_ground speed_air height
## 1 1 1 1 1 1
## pitch
## 1
## variable Estimate odds_ratio reg_coef_direction Pr(>|z|)
## lv11 speed_ground 8.84971670 6972.4134377 positive 3.935339e-14
## lv12 speed_air 4.98810682 146.6585091 positive 4.334124e-11
## lv8 aircraft -0.43130192 0.6496627 negative 8.398591e-05
## lv14 pitch 0.21090555 1.2347957 positive 4.664982e-02
## lv13 height 0.08438418 1.0880468 positive 4.218576e-01
## lv10 no_pasg -0.05436003 0.9470911 negative 6.058565e-01
## lv9 duration -0.05175818 0.9495585 negative 6.305122e-01
From the above table, we can observe that the three most significant factors are speed_ground, speed_air and aircraft by p-value.
The logistic regression analysis identifies the relationship between long.landing (a binary response variable) and potential predictors, including aircraft, duration, no_pasg, speed_ground, speed_air, height, and pitch. The results provide regression coefficients, odds ratios, the direction of the coefficients (positive or negative), and p-values for each predictor. Variables with smaller p-values are more statistically significant and ranked higher in their contribution to predicting long landings. The odds ratios reveal the magnitude of change in the likelihood of a long landing for a unit change in each predictor, while the direction of the coefficients indicates whether the predictor increases or decreases the probability of a long landing. Based on the findings, the most significant predictors should be prioritized for further analysis, such as multivariable logistic regression or interaction effect exploration. Additionally, practical steps can be taken to mitigate the risk of long landings by focusing on modifiable factors such as speed_ground and height. Standardizing variables ensures consistency across predictors, aiding in model interpretation and refinement. From the above table, we can observe that the three most significant factors are speed_ground, speed_air and aircraft by p-value.
Step 4
##
## Call:
## glm(formula = long.landing ~ speed_ground + speed_air + aircraft,
## family = binomial, data = stdz_FAA.l1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.4480 5.2625 1.035 0.30056
## speed_ground -3.0381 4.1011 -0.741 0.45882
## speed_air 8.6033 2.6413 3.257 0.00113 **
## aircraft -1.8993 0.4295 -4.422 9.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 281.373 on 202 degrees of freedom
## Residual deviance: 69.013 on 199 degrees of freedom
## (628 observations deleted due to missingness)
## AIC: 77.013
##
## Number of Fisher Scoring iterations: 8
The visualizations highlight the relationship between
long.landing
and significant predictors, including
speed_ground
, speed_air
,
aircraft
, and pitch
. From the regression
model, the coefficient for speed_air
is
8.6033 (p-value = 0.0011), indicating
a strong positive association between higher airspeed and the likelihood
of long landings (long.landing = 1
). However,
speed_ground
has a coefficient of -3.0381
(p-value = 0.4582), suggesting no statistically
significant relationship with long landings in this model. The
aircraft
variable has a coefficient of
-1.8993 (p-value < 9.78e-06),
reflecting a significant negative association, where certain aircraft
characteristics may reduce the probability of long landings. The
intercept is 5.4480 (p-value =
0.3006), representing the baseline log odds of a long
landing when all predictors are zero. These findings emphasize that
managing speed_air
and understanding the impact of aircraft
type are critical for reducing long landings. Additional exploration of
interaction effects between predictors, such as
speed_ground
and speed_air
, may provide deeper
insights into their combined influence. The model’s residual deviance of
69.013 and AIC of 77.013 indicate a
reasonable fit, but further validation on unseen data is necessary to
confirm generalizability.
Step 5
##
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft, family = binomial,
## data = FAA_l_full)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.7036 1.7714 -7.172 7.42e-13 ***
## speed_ground 10.9667 1.5815 6.934 4.08e-12 ***
## aircraft -1.6156 0.3553 -4.547 5.45e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 622.778 on 830 degrees of freedom
## Residual deviance: 84.665 on 828 degrees of freedom
## AIC: 90.665
##
## Number of Fisher Scoring iterations: 10
Based on the results from Steps 3-4 and the collinearity analysis in Step 16 (Part 1), speed_ground and aircraft are included in the full model. However, speed_air is excluded due to its high collinearity with speed_ground and the presence of numerous missing values, which could negatively impact both AIC and BIC. Although speed_air is statistically significant, these factors make it unsuitable for inclusion in the final model.
The full logistic regression model, incorporating
speed_ground
and aircraft
as predictors,
highlights their significant role in predicting
long.landing
. The results show that higher
speed_ground
is positively associated with an increased
likelihood of long landings, as indicated by a coefficient of
10.9667 (p-value < 4.08e-12). This
finding emphasizes that controlling ground speed is crucial to reducing
risk. Conversely, the aircraft
variable has a negative
coefficient of -1.6156 (p-value <
5.45e-06), suggesting that certain aircraft
characteristics might reduce the probability of long landings,
potentially due to design or operational differences. The intercept is
-12.7036 (p-value < 7.42e-13),
representing the baseline log odds of a long landing when all predictors
are at zero. The model achieves a residual deviance of
84.665 on 828 degrees of freedom and
an AIC of 90.665, indicating a good model fit. Both
predictors are statistically significant with small p-values, suggesting
a strong and reliable relationship with the response variable. This
model effectively addresses collinearity concerns and provides a solid
foundation for predictive analysis. Next steps include validating the
model on unseen data to assess its generalizability and exploring
potential interaction effects between the predictors for further
refinement.
Step 6
##
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft + height +
## pitch, family = binomial, data = stdz_FAA.l1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -22.0249 4.3431 -5.071 3.95e-07 ***
## speed_ground 19.1602 3.8014 5.040 4.65e-07 ***
## aircraft -2.5627 0.5894 -4.348 1.37e-05 ***
## height 2.5240 0.6713 3.760 0.00017 ***
## pitch 0.8096 0.4429 1.828 0.06755 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 622.778 on 830 degrees of freedom
## Residual deviance: 53.204 on 826 degrees of freedom
## AIC: 63.204
##
## Number of Fisher Scoring iterations: 12
The results of forward variable selection using AIC indicate that speed_ground, aircraft, and height are significant predictors, with pitch also included in the model. However, when compared to the table in Step 3, pitch appears to be more significant than height based on its p-value, suggesting that it plays a stronger role in the model than initially observed.
The logistic regression model fitted with forward selection includes
the predictors speed_ground
, aircraft
,
height
, and pitch
to predict
long.landing
. The model summary shows the following:
speed_ground
: The coefficient is
19.1602 (p-value < 4.65e-07),
indicating a strong and statistically significant positive relationship
between ground speed and the likelihood of a long landing.
aircraft
: The coefficient is
-2.5627 (p-value < 1.37e-05),
suggesting that certain aircraft types significantly reduce the
probability of a long landing.
height
: The coefficient is
2.5240 (p-value < 0.00017), showing
a significant positive association between height and the likelihood of
a long landing.
pitch
: The
coefficient is 0.8096 (p-value =
0.06755), indicating a weaker, marginally significant
relationship between pitch and long landings.
The intercept is
-22.0249 (p-value < 3.95e-07),
representing the baseline log odds when all predictors are zero.
The residual deviance is 53.204 on 826 degrees of freedom, and the AIC is 63.204, suggesting a good model fit.
Step 7
##
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft + height,
## family = binomial, data = stdz_FAA.l1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -19.8638 3.6399 -5.457 4.84e-08 ***
## speed_ground 17.3600 3.2304 5.374 7.70e-08 ***
## aircraft -2.5196 0.5566 -4.527 5.99e-06 ***
## height 2.2609 0.5831 3.877 0.000106 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 622.778 on 830 degrees of freedom
## Residual deviance: 57.047 on 827 degrees of freedom
## AIC: 65.047
##
## Number of Fisher Scoring iterations: 11
Based on the results of forward variable selection using BIC, the significant factors included in the model are speed_ground, aircraft, and height. However, when comparing this to the table in Step 3, we observe that pitch appears more significant than height based on its p-value. Additionally, compared to the AIC model results, the BIC model excludes the pitch variable. This is because BIC tends to favor simpler models with fewer variables compared to AIC, which allows for more complexity.
The logistic regression model selected using BIC includes speed_ground, aircraft, and height as significant predictors of long.landing, while excluding pitch, despite its significance in earlier steps. All included variables are highly significant (p < 0.001), with speed_ground having the strongest effect. The model shows a substantial reduction in deviance (Null Deviance: 622.778 → Residual Deviance: 57.047) and a low AIC (65.047), indicating strong predictive performance. BIC’s preference for simpler models led to the exclusion of pitch, favoring a more parsimonious model. While this model is efficient, an AIC-based selection may be considered if slightly better predictive performance is preferred over simplicity.
Step 8
• BIC-selected logistic regression model chosen for its balance
between simplicity and accuracy.
• Higher speeds increase the risk
of long landings (Odds Ratio: 94989.07).Certain aircraft types reduce
the risk (Odds Ratio: 0.61).Higher altitude at the threshold increases
the risk (Odds Ratio: 2.34).Higher speed_ground and altitude increase
the risk of long landings, while certain aircraft types reduce it.
•
Visuals: Graphs will illustrate how speed_ground increases long landing
risk, how different aircraft types affect landing outcomes, and the
combined impact of speed and aircraft type on long landing
probability.
• Pitch was excluded from the model due to lack of
statistical significance.
I have selected the BIC model for analyzing long landings as it provides a simpler, less complex representation compared to other models. The pitch variable is not included in the final model since it lacks statistical significance in the BIC selection, and no clear trend was observed in the association graph. The analysis indicates that an increase in ground speed significantly raises the risk of long landing. While air speed also contributes to this risk, it is excluded from the model due to collinearity and missing values, which could impact the AIC-based selection. 1. BIC-selected logistic regression model for long landings, as it balances simplicity and predictive accuracy. 2.
Identifying important factors using the binary data of “risky.landing”.
Step 9
Performing single factor regression analysis for each of the potential risk factors.
## variable p-value reg_coef_direction
## vr4 speed_ground 6.898006e-08 positive
## vr5 speed_air 3.728032e-06 positive
## vr1 aircraft 4.560563e-04 negative
## vr7 pitch 1.432961e-01 positive
## vr3 no_pasg 1.536237e-01 negative
## vr2 duration 6.801987e-01 negative
## vr6 height 8.705917e-01 negative
## aircraft duration no_pasg speed_ground speed_air
## -4.969951e-17 -8.871411e-17 2.040555e-16 -3.188666e-16 6.767302e-16
## height pitch
## 5.488281e-17 4.857226e-16
## aircraft duration no_pasg speed_ground speed_air height
## 1 1 1 1 1 1
## pitch
## 1
## variable Estimate odds_ratio reg_coef_direction Pr(>|z|)
## vr11 speed_ground 11.50780308 9.948907e+04 positive 6.898006e-08
## vr12 speed_air 8.47447435 4.790904e+03 positive 3.728032e-06
## vr8 aircraft -0.50000891 6.065253e-01 negative 4.560563e-04
## vr14 pitch 0.19539501 1.215791e+00 positive 1.432961e-01
## vr10 no_pasg -0.19012470 8.268560e-01 negative 1.536237e-01
## vr9 duration -0.05569118 9.458312e-01 negative 6.801987e-01
## vr13 height -0.02170864 9.785253e-01 negative 8.705917e-01
##
## Call:
## glm(formula = risky.landing ~ speed_ground + speed_air + aircraft,
## family = binomial, data = stdz_FAA.r1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.4105 8.6503 -0.741 0.45865
## speed_ground 0.8296 6.3432 0.131 0.89594
## speed_air 11.5776 3.9110 2.960 0.00307 **
## aircraft -2.3057 0.7884 -2.924 0.00345 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 248.180 on 202 degrees of freedom
## Residual deviance: 26.279 on 199 degrees of freedom
## (628 observations deleted due to missingness)
## AIC: 34.279
##
## Number of Fisher Scoring iterations: 10
##
## Call:
## glm(formula = risky.landing ~ ., family = binomial, data = FAA_r_full)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.5272 6.4611 -4.106 4.03e-05 ***
## aircraft -2.0060 0.6236 -3.217 0.0013 **
## speed_ground 17.3544 4.2116 4.121 3.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 436.043 on 830 degrees of freedom
## Residual deviance: 40.097 on 828 degrees of freedom
## AIC: 46.097
##
## Number of Fisher Scoring iterations: 12
##
## Call:
## glm(formula = risky.landing ~ speed_ground + aircraft + no_pasg,
## family = binomial, data = stdz_FAA.r1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -27.2919 6.7889 -4.020 5.82e-05 ***
## speed_ground 17.7920 4.4139 4.031 5.56e-05 ***
## aircraft -2.3169 0.7363 -3.147 0.00165 **
## no_pasg -0.6339 0.4294 -1.476 0.13987
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 436.043 on 830 degrees of freedom
## Residual deviance: 37.707 on 827 degrees of freedom
## AIC: 45.707
##
## Number of Fisher Scoring iterations: 12
##
## Call:
## glm(formula = risky.landing ~ speed_ground + aircraft, family = binomial,
## data = stdz_FAA.r1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.5272 6.4611 -4.106 4.03e-05 ***
## speed_ground 17.3544 4.2116 4.121 3.78e-05 ***
## aircraft -2.0060 0.6236 -3.217 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 436.043 on 830 degrees of freedom
## Residual deviance: 40.097 on 828 degrees of freedom
## AIC: 46.097
##
## Number of Fisher Scoring iterations: 12
1. Single Factor Logistic Regression Analysis
The single factor regression analysis results indicate that
speed_ground
, speed_air
, and
aircraft
have very low p-values
(<0.001), suggesting strong statistical significance in
predicting risky.landing
. On the other hand,
pitch
, no_pasg
, duration
, and
height
have high p-values (>0.05),
indicating that they are not statistically significant predictors.
Regarding the direction of the regression coefficients,
speed_ground
, speed_air
, and
pitch
have positive coefficients, meaning
they increase the likelihood of risky landing.
Conversely, aircraft
, no_pasg
,
duration
, and height
have negative
coefficients, suggesting that they reduce the
likelihood of risky landing.
Given these findings, speed_ground, speed_air, and aircraft should be retained for further analysis, while pitch, no_pasg, duration, and height should be excluded due to their lack of statistical significance.
2. Multiple Logistic Regression Models
When multiple logistic regression models are considered, the model
including speed_ground
, speed_air
, and
aircraft
shows that speed_air
and
aircraft
remain statistically significant
(p < 0.01
), while speed_ground
has
a high standard error, indicating potential
multicollinearity. The low residual deviance (26.279)
suggests a well-fitting model.
A different model incorporating speed_ground
,
aircraft
, and no_pasg
retains
speed_ground
and aircraft
as significant
predictors, while no_pasg
remains insignificant
(p = 0.13987
), meaning it does not add much
predictive value. The AIC value for this model is
45.707, indicating a slight improvement in model
selection.
Finally, the model with only speed_ground
and
aircraft
is found to be the best-performing
model. It has highly significant predictors,
the lowest AIC (46.097), and a low residual
deviance (40.097), indicating a strong model fit. Based on
these findings, the model with speed_ground
and
aircraft
is the preferred choice for predicting
risky.landing
.
3. Model Selection Using AIC and BIC
The AIC-based model selection includes speed_ground, speed_air, and aircraft and has the lowest AIC value (34.279), suggesting it has strong predictive ability. Meanwhile, the BIC-based model selects only speed_ground and aircraft, dropping speed_air for a simpler model with an AIC of 45.707.
Since AIC tends to favor more complex models while BIC prioritizes simplicity, the BIC-selected model (speed_ground and aircraft) is the better choice to avoid overfitting while maintaining strong predictive power.
The final analysis identifies speed_ground and
aircraft as the most significant predictors of
risky.landing
. The BIC-selected model is
preferred due to its simplicity and interpretability,
while still maintaining good model performance.
Step 10
Risk Factors for Risky Landings and Their Influence on Occurrence – FAA Agent Report
I have selected the BIC model for analyzing risky landings, as it offers a less complex representation compared to other models. The variable no_pasg does not hold significant importance in the BIC model when compared to the AIC model and the full model, leading to its exclusion. The analysis indicates that an increase in ground speed significantly raises the risk of a risky landing. While air speed also contributes to this risk, it is excluded from the final model due to collinearity and missing values, which could impact AIC-based selection. The table below presents a comparison of the models built for long landings and risky landings.
- Higher landing speed greatly increases the risk of a risky landing, with even small speed increases dramatically raising the odds.
- Certain aircraft types are associated with lower risk, meaning some planes handle landings better than others.
- A table summarizing the regression results will highlight speed and aircraft type as the key risk factors.
- Three visuals will show how risky landings increase with speed, vary by aircraft type, and the predicted probability of risk based on both factors.
- Controlling landing speed and optimizing aircraft-specific landing protocols can significantly reduce risky landings and improve safety.
Compare the two models built for “long.landing” and “risky.landing”
Step 11
##
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft + height,
## family = binomial, data = stdz_FAA.l1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -19.8638 3.6399 -5.457 4.84e-08 ***
## speed_ground 17.3600 3.2304 5.374 7.70e-08 ***
## aircraft -2.5196 0.5566 -4.527 5.99e-06 ***
## height 2.2609 0.5831 3.877 0.000106 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 622.778 on 830 degrees of freedom
## Residual deviance: 57.047 on 827 degrees of freedom
## AIC: 65.047
##
## Number of Fisher Scoring iterations: 11
##
## Call:
## glm(formula = risky.landing ~ speed_ground + aircraft, family = binomial,
## data = stdz_FAA.r1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.5272 6.4611 -4.106 4.03e-05 ***
## speed_ground 17.3544 4.2116 4.121 3.78e-05 ***
## aircraft -2.0060 0.6236 -3.217 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 436.043 on 830 degrees of freedom
## Residual deviance: 40.097 on 828 degrees of freedom
## AIC: 46.097
##
## Number of Fisher Scoring iterations: 12
- Speed_ground and aircraft are the two most significant
factors in both models, showing a consistent influence on
landing outcomes.
- Height is included as a predictor in the BIC model for long
landings but not in the risky landing model, indicating a
difference in contributing factors.
- The risky landing model has a lower AIC than the long landing model, suggesting it provides a relatively better fit for predicting risky landings.
Step 12
The ROC curves for the long landing and risky landing models indicate that the area under the curve (AUC) is higher for the risky landing model, suggesting better predictive performance. This implies that the risky landing model provides a better fit compared to the long landing model, making it more effective in distinguishing between outcomes.
Step 13
##
## --- Prediction Results ---
## Long Landing Probability (95% CI): 1 - 1
## Risky Landing Probability (95% CI): 0.999 - 1.001
The prediction results indicate that the probability of a long landing is 100% (95% CI: 1 - 1), meaning the model is completely certain that the given airplane will experience a long landing. Similarly, the probability of a risky landing is extremely high at ~100% (95% CI: 0.999 - 1.001), suggesting that under these conditions, the aircraft is almost guaranteed to have a risky landing.
Step 14
##
## Call: glm(formula = risky.landing ~ speed_ground + aircraft, family = binomial(link = probit),
## data = stdz_FAA.r1)
##
## Coefficients:
## (Intercept) speed_ground aircraft
## -15.259 9.972 -1.176
##
## Degrees of Freedom: 830 Total (i.e. Null); 828 Residual
## Null Deviance: 436
## Residual Deviance: 39.44 AIC: 45.44
##
## Call: glm(formula = risky.landing ~ speed_ground + aircraft, family = binomial(link = cloglog),
## data = stdz_FAA.r1)
##
## Coefficients:
## (Intercept) speed_ground aircraft
## -18.436 11.655 -1.447
##
## Degrees of Freedom: 830 Total (i.e. Null); 828 Residual
## Null Deviance: 436
## Residual Deviance: 41.44 AIC: 47.44
##
## Call: glm(formula = risky.landing ~ speed_ground + aircraft, family = binomial,
## data = stdz_FAA.r1)
##
## Coefficients:
## (Intercept) speed_ground aircraft
## -26.527 17.354 -2.006
##
## Degrees of Freedom: 830 Total (i.e. Null); 828 Residual
## Null Deviance: 436
## Residual Deviance: 40.1 AIC: 46.1
The output presents the results of three different generalized linear
models (GLMs) using Probit, Cloglog, and another logistic model for
predicting risky landings. All three models include
speed_ground
and aircraft
as predictors. The
coefficients for speed_ground
and aircraft
have the same direction across all models, meaning that their influence
on the probability of a risky landing is consistent. The Probit and
Cloglog models have similar coefficient magnitudes, while the logistic
model has slightly higher values.
Additionally, the Akaike Information Criterion (AIC) values for all three models are relatively close - 45.44 for Probit, 47.44 for Cloglog, and 46.1 for the logistic model. The residual deviances are also similar, suggesting that all three models have comparable goodness-of-fit.
Since all three models identify speed_ground
and
aircraft
as significant risk factors for a risky landing,
these variables should be given priority in further analysis and risk
mitigation strategies. The close AIC values indicate that no model
overwhelmingly outperforms the others, but the Probit model has the
lowest AIC, suggesting a slightly better fit.
Step 15
The ROC curves for the three models—Probit, Cloglog, and Model R—illustrate their classification performance. All three models demonstrate a low false positive rate (1 - Specificity), indicating strong predictive ability. Among them, the Cloglog model appears to have a slightly higher Area Under the Curve (AUC) compared to the other two models. A higher AUC suggests that the model is better at distinguishing between the positive and negative classes. Additionally, all models exhibit a steep increase in sensitivity at low false positive rates, further reinforcing their effectiveness.
Since the Cloglog model has the highest AUC, it is the most suitable model for making predictions. This suggests that the hazard model (Cloglog) provides better classification performance while minimizing false positives. Based on this observation, the Cloglog model should be prioritized for further analysis and potential deployment. However, additional evaluation using other metrics such as precision, recall, and the F1-score would be beneficial to ensure a comprehensive assessment of its reliability. If interpretability is a key concern, further model diagnostics should be performed before finalizing the decision.
Step 16
## Top 5 Risky Landings Based on Predicted Probabilities
##
##
## Model 1: Risky Landings
## 362 307 64 387 408
## 1 1 1 1 1
##
## Model 2: Risky Landings
## 56 64 134 176 179
## 1 1 1 1 1
##
## Model 3: Risky Landings
## 19 29 30 56 64
## 1 1 1 1 1
We can observe that 64th observation appears in all the three models, where as 56th observation appears in probit and hazard model.
The output displays the top five most risky landings based on predicted probabilities for three different models. Each model (Model 1, Model 2, and Model 3) identifies a unique set of high-risk observations. However, there are notable overlaps among the models. Specifically, the 64th observation appears in all three models, indicating that it is consistently classified as a high-risk landing across different modeling approaches. Additionally, the 56th observation appears in both the Probit and Hazard models, suggesting a strong agreement between these two models regarding its risk level.
The consistency of the 64th observation across all models suggests that this particular case is highly risky and should be given priority in further analysis. Similarly, the repeated appearance of the 56th observation in two models reinforces its classification as a potentially high-risk event. Moving forward, these high-risk observations should be further investigated to understand their underlying causes. Additional validation using real-world data or expert domain knowledge may help confirm their risk level. If these observations represent actual risky landing scenarios, targeted interventions or safety measures should be considered to mitigate potential risks.
Step 17
## 1 1
## 0.9999976 1.0000020
## 1 1
## 1 1
The output presents predicted probabilities from both the Probit and
Cloglog models for a given new observation with specific
characteristics. The predicted probabilities for both models are
extremely close to 1 (0.9999976 and 1.0000020), indicating a very high
likelihood of a risky landing. Additionally, the confidence intervals
for both predictions suggest that the classification is highly certain.
The final classification (binary outcome) for both models is
1
, reinforcing the conclusion that the given input
conditions strongly indicate a risky landing.
Both models consistently predict that the given conditions will result
in a risky landing with near certainty. This suggests that the input
factors—such as speed, height, and aircraft type—strongly contribute to
the risk classification. Given this result, it would be important to
analyze whether these threshold conditions align with historical risky
landings and if preventive measures could be implemented. Further
investigation could include evaluating how changes in these input
parameters (e.g., reducing speed or adjusting height) affect the
probability of a risky landing. If applicable, this could inform
decision-making for flight operations, pilot training, or automated
landing assistance systems to minimize potential risks.