Part 1: Data cleaning

The dataset was loaded and cleaned as part of previous assignment. The cleaned data is stored in FAA_c data frame. A minimal version of the code is as follows:

# Step 1
FAA1 <- read_excel("FAA1.xls", sheet = "FAA1")
FAA2 <- read_excel("FAA2.xls", sheet = "FAA2")

# Step 3
FAA_COMBO <- union_all(FAA1, FAA2)
FAA <- FAA_COMBO[!duplicated(FAA_COMBO[, c(1,3,4,5,6,7,8)]),]

# Step 6
### filtering abnormal values of duration
FAA_c <- FAA[(FAA$duration > 40 | is.na(FAA$duration)),]

### filtering abnormal values of speed_ground
FAA_c <- FAA_c[(FAA_c$speed_ground > 30 & FAA_c$speed_ground < 140),]

### filtering abnormal values of speed_air
FAA_c <- FAA_c[((FAA_c$speed_air > 30 & FAA_c$speed_air < 140) | is.na(FAA_c$speed_air)),]

### filtering abnormal values of height
FAA_c <- FAA_c[FAA_c$height >= 6,]

### filtering abnormal values of distance
FAA_c <- FAA_c[FAA_c$distance < 6000,]

The cleaned dataset is stored as FAA_c. It has 831 observations and 8 variables.

Part 2: Model binary response using logistic regression

Create binary response

Step 1: Attach binary variables

This section utilizes the cleaned dataset generated in Part 1. Two binary variables long.landing and risky.landing are created and added to FAA_c as per following rule.

\[long.landing = 1\;if\;distance > 2500; =0\;otherwise\]

\[risky.landing = 1\;if\;distance > 3000; =0\;otherwise\]

long.landing <- ifelse(FAA_c$distance > 2500, 1, 0)
risky.landing <- ifelse(FAA_c$distance > 3000, 1, 0)

FAA_b <- cbind(FAA_c[-8], long.landing, risky.landing)

The continuous distance field is not used and instead the above mentioned binary fields are used.

There are 103 instances of long landing and 61 instances of risky landing.

Identify important parameters for predicting long.landing binary variable

(Long Landing) Step 2

Since long.landing is binary, it will only have value 0 or 1. Their distribution is as follows:

ggplot(FAA_b, aes(x = long.landing)) + geom_bar()
Distribution of long.landing variable

Distribution of long.landing variable

pct<-round(table(FAA_b$long.landing)/length(FAA_b$long.landing)*100,1)
labs<-c("No","Yes")
labs<-paste(labs,pct)
labs<-paste(labs,"%",sep="" )
pie(table(FAA_b$long.landing),labels=labs,col=rainbow(length(labs)),
    main="Pie chart of whether it was a long landing")

It is observed that 12.4% of the observations are long landings. Thus, the data for the response variable long.landing is unbalanced.

(Long Landing) Step 3

A single factor regression analysis against response long.landing is performed for each potential risk factor. The code is:

glm_aircraft <- glm(long.landing ~ aircraft, 
                    data = FAA_b, family = "binomial")
glm_duration <- glm(long.landing ~ duration, 
                    data = FAA_b, family = "binomial")
glm_no_pasg <- glm(long.landing ~ no_pasg, 
                   data = FAA_b, family = "binomial")
glm_speed_ground <- glm(long.landing ~ speed_ground, 
                        data = FAA_b, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm_speed_air <- glm(long.landing ~ speed_air, 
                     data = FAA_b, family = "binomial")
glm_height <- glm(long.landing ~ height, 
                  data = FAA_b, family = "binomial")
glm_pitch <- glm(long.landing ~ pitch, 
                 data = FAA_b, family = "binomial")

The factors are scaled in order to assess their impact on the response variable on the basis of the coefficient estimates.

FAA_b$duration_n <- 
  (FAA_b$duration - mean(FAA_b$duration, na.rm = T))/sd(FAA_b$duration, na.rm = T)
FAA_b$no_pasg_n <- 
  (FAA_b$no_pasg - mean(FAA_b$no_pasg, na.rm = T))/sd(FAA_b$no_pasg, na.rm = T)
FAA_b$speed_ground_n <- 
  (FAA_b$speed_ground - mean(FAA_b$speed_ground, na.rm = T))/sd(FAA_b$speed_ground, na.rm = T)
FAA_b$speed_air_n <- 
  (FAA_b$speed_air - mean(FAA_b$speed_air, na.rm = T))/sd(FAA_b$speed_air, na.rm = T)
FAA_b$height_n <- 
  (FAA_b$height - mean(FAA_b$height, na.rm = T))/sd(FAA_b$height, na.rm = T)
FAA_b$pitch_n <- 
  (FAA_b$pitch - mean(FAA_b$pitch, na.rm = T))/sd(FAA_b$pitch, na.rm = T)

A single factor regression analysis was again performed against response long.landing for each of the scaled predictors. The code is:

glm_aircraft_n <- glm(long.landing ~ aircraft, 
                      data = FAA_b, family = "binomial")
glm_duration_n <- glm(long.landing ~ duration_n, 
                      data = FAA_b, family = "binomial")
glm_no_pasg_n <- glm(long.landing ~ no_pasg_n, 
                     data = FAA_b, family = "binomial")
glm_speed_ground_n <- glm(long.landing ~ speed_ground_n, 
                          data = FAA_b, family = "binomial")
glm_speed_air_n <- glm(long.landing ~ speed_air_n, 
                       data = FAA_b, family = "binomial")
glm_height_n <- glm(long.landing ~ height_n, 
                    data = FAA_b, family = "binomial")
glm_pitch_n <- glm(long.landing ~ pitch_n, 
                   data = FAA_b, family = "binomial")
Variable ranked by P-value
Variable Abs. Coefficient Odds Ratio Direction P-Value
4 speed_ground 0.4723458 1.603752 positive 0.0000000
5 speed_air 0.5123218 1.669162 positive 0.0000000
1 aircraft 0.8641199 2.372917 positive 0.0000840
7 pitch 0.4005278 1.492612 positive 0.0466498
6 height 0.0086240 1.008661 positive 0.4218576
3 no_pasg 0.0072564 1.007283 negative 0.6058565
2 duration 0.0010705 1.001071 negative 0.6305122
## [1] "--"
Variable ranked by P-value (Scaled)
Variable Abs. Coefficient Odds Ratio Direction P-Value
4 speed_ground 8.8497167 6972.413438 positive 0.0000000
5 speed_air 4.9881068 146.658509 positive 0.0000000
1 aircraft 0.8641199 2.372917 positive 0.0000840
7 pitch 0.2109056 1.234796 positive 0.0466498
6 height 0.0843842 1.088047 positive 0.4218576
3 no_pasg 0.0543600 1.055865 negative 0.6058565
2 duration 0.0517582 1.053121 negative 0.6305122

The factors are arranged in order of their importance based on the p-values. The p-value should be less than 0.05 for a 95% level of confidence used while testing the null hypothesis that the factor does not impact the response variable. The smaller the p-value, the more significant the factor.

Based on these criteria speed_ground, speed_air, aircraft and pitch were identified as statistically important factors for analysis.

Note: The odds ratio for long.landing is higher when aircraft make is Boeing.

(Long Landing) Step 4: Visualization

The above factors and their relation with long.landing is shown below. The charts represent various graphical representations of the significant factors - Aircraft, Speed Ground, Speed Air and Pitch.

plot(as.factor(long.landing) ~ speed_ground, data = FAA_b)

plot(as.factor(long.landing) ~ speed_air, data = FAA_b)

plot(as.factor(long.landing) ~ pitch, data = FAA_b)

plot(jitter(long.landing,0.1)~jitter(speed_ground),FAA_b,xlab="Speed Ground",
     ylab="Long Landing",pch=".")

plot(jitter(long.landing,0.1)~jitter(speed_air),FAA_b,xlab="Speed Air",
     ylab="Long Landing",pch=".")

plot(jitter(long.landing,0.1)~jitter(pitch),FAA_b,xlab="Pitch",
     ylab="Long Landing",pch=".")

ggplot(FAA_b,aes(x=aircraft,fill=as.factor(long.landing)))+
  geom_bar(position="dodge")

ggplot(FAA_b,aes(x=speed_ground,fill=as.factor(long.landing)))+
  geom_histogram(position="dodge",binwidth=1)

ggplot(FAA_b,aes(x=speed_air,fill=as.factor(long.landing)))+
  geom_histogram(position="dodge",binwidth=1)

ggplot(FAA_b,aes(x=pitch,fill=as.factor(long.landing)))+
  geom_histogram(position="dodge",binwidth=1)

The significant variables and their relationship with long landing were visualized using a bar chart, scatter plot and histogram.

Based on the above charts and graphs, the following conclusions can be made for each significant variable:

  • Speed_ground : All the instances of long landing are corresponding to values of speed_ground greater than 90mph
  • Speed_air : All the instances of long landing are corresponding to values of speed_air greater than 95mph
  • Aircraft : The ratio long landing / not a long landing is significantly higher for aircraft make Boeing. The absolute number of instances of long landing are also higher for Boeing.
  • Pitch : There is a slight increase in the number of long landing instances for a higher pitch as compared to a lower pitch value.

(Long Landing) Step 5: Create full model

A full model is created based on output of last two steps after performing a collinearity check for the factors speed_ground and speed_air.

# checking for collinearity
model1 <- glm(long.landing ~ speed_ground, 
              data = FAA_b, family = "binomial")
model2 <- glm(long.landing ~ speed_air, 
              data = FAA_b, family = "binomial")
model3 <- glm(long.landing ~ speed_ground + speed_air, 
              data = FAA_b, family = "binomial")

correlation <- cor(FAA_b$speed_ground, FAA_b$speed_air, use = "complete.obs")

The correlation value between speed_ground and speed_air is 0.9879383.

  • AIC value for speed_ground: 119.4703208
  • AIC value for speed_air: 106.3274885
  • AIC value for speed_ground + speed_air: 108.0278797

The factors speed_groud and speed_air show a very high correlation. Based on the AIC values of the three models, the factor speed_air has the lowest AIC value. However, speed_air has 628 missing values while speed_ground has 0 missing values. Hence, speed_air is removed from further analysis and only speed_ground is included.

The full model is created based on the cleaned dataset using following snippet. Its various metrics are shown below:

full.model.l <- glm(long.landing ~ aircraft + speed_ground + pitch, 
  data = FAA_b, 
  family = "binomial")

s <- summary(full.model.l)
kable(s$coefficients, caption = "Model Coefficients", digits = 3)
Model Coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -67.929 10.484 -6.479 0.000
aircraftboeing 3.043 0.733 4.150 0.000
speed_ground 0.615 0.092 6.694 0.000
pitch 1.066 0.604 1.765 0.078

The variable pitch is not significant for this final model. Aircraft make and speed_ground are the only two variables impacting the response long.landing significantly.

AIC for the model is 89.30904

(Long Landing) Step 6: Forward variable selection using AIC (using step() method)

Another model was built using forward stepwise selection with AIC criterion on all the factors.

model0 <- glm(
  long.landing ~ aircraft + duration + no_pasg + speed_ground + speed_air + height + pitch,
  data = FAA_b, 
  family = "binomial")

aic.model.l <- step(model0, trace = 0, direction = "forward")
Coefficients for Forward Variable Selection using AIC
Estimate Std. Error z value Pr(>|z|)
(Intercept) -196.359 56.071 -3.502 0.000
aircraftboeing 8.784 2.623 3.349 0.001
height 0.423 0.143 2.956 0.003
speed_air 1.985 0.708 2.804 0.005
pitch 1.469 1.055 1.392 0.164
no_pasg -0.074 0.070 -1.050 0.294
speed_ground -0.226 0.384 -0.587 0.557
duration 0.000 0.010 0.029 0.977

Other parameters of AIC model are:

  • Null deviance: 270.1991812 on 194 on degrees of freedom
  • Residual deviance: 32.9094605 on 187 degrees of freedom
  • AIC: 48.9094605
  • Number of Fischer scoring iterations: 10
  • Observations deleted due to missigness: 636

The forward selection model with AIC criterion provided aircraft, speed_air and height as significant factors. height is significant in this model and was not significant in the model built in step 5.
AIC value for this model is 48.9094605 which is much lower than the full model in step 5, suggesting this is a better model.

(Long Landing) Step 7: Forward variable selection using BIC (using step() method)

Next, a model was built using forward stepwise selection with BIC criterion on all the factors.

bic.model.l <- step(model0, trace = 0, direction = "forward", criterion = "BIC")
Coefficients for Forward Variable Selection using BIC
Estimate Std. Error z value Pr(>|z|)
(Intercept) -196.359 56.071 -3.502 0.000
aircraftboeing 8.784 2.623 3.349 0.001
height 0.423 0.143 2.956 0.003
speed_air 1.985 0.708 2.804 0.005
pitch 1.469 1.055 1.392 0.164
no_pasg -0.074 0.070 -1.050 0.294
speed_ground -0.226 0.384 -0.587 0.557
duration 0.000 0.010 0.029 0.977

Other parameters of BIC model are:

  • Null deviance: 270.199 on 194 on degrees of freedom
  • Residual deviance: 32.909 on 187 degrees of freedom
  • BIC: 48.9094605
  • Number of Fischer scoring iterations: 10
  • Observations deleted due to missigness: 636

The forward selection model with BIC criterion provided aircraft, speed_air and height as significant factors. The first two factors are same as the ones obtained in the model in Step 5. However, height is significant in this model and was not significant in the model built in step 5. This is the same model as obtained in step 6
AIC value for this model is 48.9094605 which is much lower than the model in step 5, suggesting this is a better model.

Step 8: Risk factors for long landing and how they impact its occurrence.

Given a limited opportunity to make my case, I would prefer following information and approach:

Model: The model has aircraft, speed_air and height as the significant predictors for the response variable long landing.

Table:

ggplot(FAA_b,aes(x=aircraft,fill=as.factor(long.landing)))+
  geom_bar(position="dodge")

ggplot(FAA_b,aes(x=speed_air,fill=as.factor(long.landing)))+
  geom_histogram(position="dodge",binwidth=1)
## Warning: Removed 628 rows containing non-finite values (stat_bin).

ggplot(FAA_b,aes(x=height,fill=as.factor(long.landing)))+
  geom_histogram(position="dodge",binwidth=1)

Executive summary:

  1. The data is unbalanced and has 103 entries with an instance of long landing out of 831
  2. If the aircraft make is Boeing the chances of long landing are drastically higher than that for the aircraft Airbus
  3. For every 1 mph increase in the value of speed air, the chances of long landing increase by 628%.
  4. For every 1 m increase in the value of height, the chances of long landing increase by 53%

Identify important parameters using risky.landing

(Risky Landing) Step 2

Since risky.landing is binary, it will only have value 0 or 1. Their distribution is as follows:

ggplot(FAA_b, aes(x = risky.landing)) + geom_bar()
Distribution of long.landing variable

Distribution of long.landing variable

pct<-round(table(FAA_b$risky.landing)/length(FAA_b$risky.landing)*100,1)
labs<-c("No","Yes")
labs<-paste(labs,pct)
labs<-paste(labs,"%",sep="" )
pie(table(FAA_b$risky.landing),labels=labs,col=rainbow(length(labs)),
    main="Pie chart of whether it was a risky landing")

It is observed that 7.5% of the observations are risky landings. Thus, the data for the response variable risky.landing is unbalanced.

(Risky Landing) Step 3

A single factor regression analysis against response risky.landing is performed for each potential risk factor. The code is:

r_glm_aircraft <- glm(risky.landing ~ aircraft, 
                    data = FAA_b, family = "binomial")
r_glm_duration <- glm(risky.landing ~ duration, 
                    data = FAA_b, family = "binomial")
r_glm_no_pasg <- glm(risky.landing ~ no_pasg, 
                   data = FAA_b, family = "binomial")
r_glm_speed_ground <- glm(risky.landing ~ speed_ground, 
                        data = FAA_b, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
r_glm_speed_air <- glm(risky.landing ~ speed_air, 
                     data = FAA_b, family = "binomial")
r_glm_height <- glm(risky.landing ~ height, 
                  data = FAA_b, family = "binomial")
r_glm_pitch <- glm(risky.landing ~ pitch, 
                 data = FAA_b, family = "binomial")

The factors were already scaled in order to assess their impact on the response variable on the basis of the coefficient estimates.

A single factor regression analysis was again performed against response risky.landing for each of the scaled predictors.

r_glm_aircraft_n <- glm(risky.landing ~ aircraft, 
                      data = FAA_b, family = "binomial")
r_glm_duration_n <- glm(risky.landing ~ duration_n, 
                      data = FAA_b, family = "binomial")
r_glm_no_pasg_n <- glm(risky.landing ~ no_pasg_n, 
                     data = FAA_b, family = "binomial")
r_glm_speed_ground_n <- glm(risky.landing ~ speed_ground_n, 
                          data = FAA_b, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
r_glm_speed_air_n <- glm(risky.landing ~ speed_air_n, 
                       data = FAA_b, family = "binomial")
r_glm_height_n <- glm(risky.landing ~ height_n, 
                    data = FAA_b, family = "binomial")
r_glm_pitch_n <- glm(risky.landing ~ pitch_n, 
                   data = FAA_b, family = "binomial")
Variable ranked by P-value
Variable Abs. Coefficient Odds Ratio Direction P-Value
4 speed_ground 0.6142187 1.848212 positive 0.0000001
5 speed_air 0.8704019 2.387870 positive 0.0000037
1 aircraft 1.0017753 2.723112 positive 0.0004561
7 pitch 0.3710720 1.449287 positive 0.1432961
3 no_pasg 0.0253793 1.025704 negative 0.1536237
2 duration 0.0011518 1.001152 negative 0.6801987
6 height 0.0022186 1.002221 negative 0.8705917
## [1] "--"
Variable ranked by P-value (Scaled)
Variable Abs. Coefficient Odds Ratio Direction P-Value
4 speed_ground 11.5078031 99489.071286 positive 0.0000001
5 speed_air 8.4744743 4790.903747 positive 0.0000037
1 aircraft 1.0017753 2.723112 positive 0.0004561
7 pitch 0.1953950 1.215791 positive 0.1432961
3 no_pasg 0.1901247 1.209400 negative 0.1536237
2 duration 0.0556912 1.057271 negative 0.6801987
6 height 0.0217086 1.021946 negative 0.8705917

The factors are arranged in order of their importance based on the p-values. The p-value should be less than 0.05 for a 95% level of confidence used while testing the null hypothesis that the factor does not impact the response variable. The smaller the p-value, the more significant the factor. Based on these criteria speed_ground, speed_air and aircraft were identified as statistically important factors for analysis.

Note: The odds ratio for risky.landing is higher when aircraft make is Boeing.

(Risky Landing) Step 4: Visualization

The above factors and their relation with risky.landing is shown below. The charts represent various graphical representations of the significant factors - Aircraft, Ground Speed, Air Speed

plot(as.factor(risky.landing) ~ speed_ground, data = FAA_b)

plot(as.factor(risky.landing) ~ speed_air, data = FAA_b)

plot(jitter(risky.landing,0.1)~jitter(speed_ground),FAA_b,xlab="Speed Ground",
     ylab="Risky Landing",pch=".")

plot(jitter(risky.landing,0.1)~jitter(speed_air),FAA_b,xlab="Speed Air",
     ylab="Risky Landing",pch=".")

ggplot(FAA_b,aes(x=aircraft,fill=as.factor(risky.landing)))+
  geom_bar(position="dodge")

ggplot(FAA_b,aes(x=speed_ground,fill=as.factor(risky.landing)))+
  geom_histogram(position="dodge",binwidth=1)

ggplot(FAA_b,aes(x=speed_air,fill=as.factor(risky.landing)))+
  geom_histogram(position="dodge",binwidth=1)
## Warning: Removed 628 rows containing non-finite values (stat_bin).

The significant variables and their relationship with risky landing were visualized using a bar chart, scatter plot and histogram.

Based on the above charts and graphs, the following conclusions can be made for each significant variable:

  • Speed_ground : All the instances of risky landing are corresponding to values of speed_ground greater that approximately 100mph
  • Speed_air : All the instances of risky landing are corresponding to values of speed_air greater than 100mph
  • Aircraft : The ratio long landing / not a risky landing is significantly higher for aircraft make Boeing. The absolute number of instances of long landing are also higher for Boeing.

(Risky Landing) Step 5: Create full model

A full model is created based on output of last two steps after performing a collinearity check for the factors speed_ground and speed_air.

# checking for collinearity
r_model1 <- glm(risky.landing ~ speed_ground, 
              data = FAA_b, family = "binomial")
r_model2 <- glm(risky.landing ~ speed_air, 
              data = FAA_b, family = "binomial")
r_model3 <- glm(risky.landing ~ speed_ground + speed_air,
              data = FAA_b, family = "binomial")

correlation <- cor(FAA_b$speed_ground, FAA_b$speed_air, use = "complete.obs")

The correlation value between speed_ground and speed_air is 0.9879383.

  • AIC value for speed_ground: 62.9306044
  • AIC value for speed_air: 48.5802507
  • AIC value for speed_ground + speed_air: 49.9528945

The factors speed_ground and speed_air show a very high correlation. Based on the AIC values of the three models, the factor speed_air has the lowest AIC value. However, speed_air has 628 missing values while speed_ground has 0 missing values. Hence, speed_air is removed from further analysis and only speed_ground is included.

The full model is created based on the cleaned dataset using following snippet:

full.model.r <- glm(risky.landing ~ aircraft + speed_ground, 
  data = FAA_b, 
  family = "binomial")
Model Coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -102.077 24.775 -4.120 0.000
aircraftboeing 4.019 1.249 3.217 0.001
speed_ground 0.926 0.225 4.121 0.000

Aircraft make and ground speed are impacting the response risky.landing significantly.

AIC of model is 46.0965964

(Risky Landing) Step 6: Forward variable selection using AIC (using step() method)

The model is creating using following snippet:

model0_r <- glm(
  risky.landing ~ aircraft + duration + no_pasg + speed_ground + speed_air + height + pitch,
  data = FAA_b, 
  family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
aic.model.r <- step(model0_r, trace = 0, direction = "forward")
Coefficients for Forward Variable Selection using AIC
Estimate Std. Error z value Pr(>|z|)
(Intercept) -149.419 48.425 -3.086 0.002
speed_air 1.617 0.654 2.472 0.013
aircraftboeing 7.330 3.002 2.442 0.015
no_pasg -0.120 0.096 -1.253 0.210
pitch -1.316 1.430 -0.920 0.357
height 0.045 0.058 0.786 0.432
speed_ground -0.164 0.498 -0.328 0.743
duration 0.002 0.016 0.125 0.901

Other parameters of AIC model are:

  • Null deviance: 240.7242902 on 194 on degrees of freedom
  • Residual deviance: 22.1439647 on 187 degrees of freedom
  • AIC: 38.1439647
  • Number of Fischer scoring iterations: 10
  • Observations deleted due to missigness: 636

The forward selection model with AIC criterion also provided aircraft and speed_air as significant factors. These are same as the variables selected in step 5
AIC value for this model is 38.1439647 which is much lower than the model in step 5 (46.0965964), suggesting this is a better model.

(Risky Landing) Step 7: Forward variable selection using BIC (using step() method)

bic.model.r <- step(model0_r, trace = 0, direction = "forward", criterion = "BIC")
Coefficients for Forward Variable Selection using BIC
Estimate Std. Error z value Pr(>|z|)
(Intercept) -149.419 48.425 -3.086 0.002
speed_air 1.617 0.654 2.472 0.013
aircraftboeing 7.330 3.002 2.442 0.015
no_pasg -0.120 0.096 -1.253 0.210
pitch -1.316 1.430 -0.920 0.357
height 0.045 0.058 0.786 0.432
speed_ground -0.164 0.498 -0.328 0.743
duration 0.002 0.016 0.125 0.901

Other parameters of BIC model are:

  • Null deviance: 240.724 on 194 on degrees of freedom
  • Residual deviance: 22.144 on 187 degrees of freedom
  • AIC: 38.1439647
  • Number of Fischer scoring iterations: 10
  • Observations deleted due to missigness: 636

The forward selection model with BIC criterion also provides the same model as in step 6. Aircraft and speed_air are significant factors. These are same as the variables selected in step 5. AIC value for this model is 38.1439647 which is much lower than the model in step 5 (46.0965964), suggesting this is a better model.

In the full model, aircraft and speed_ground was used. As per AIC as well as BIC, aircraft and speed_air are significant factors.

(Risky Landing) Step 10: Risk factors for risky landing and how they impact its occurrence.**

Given a limited opportunity to make my case, I would prefer following information and approach:

Model: The model has aircraft and speed_air as the significant predictors for the response variable.

Table:

ggplot(FAA_b,aes(x=aircraft,fill=as.factor(risky.landing)))+
  geom_bar(position="dodge")

ggplot(FAA_b,aes(x=speed_air,fill=as.factor(risky.landing)))+
  geom_histogram(position="dodge",binwidth=1)
## Warning: Removed 628 rows containing non-finite values (stat_bin).

Executive summary:

  1. The data is unbalanced and has 61 entries with an instance of long landing out of 831
  2. If the aircraft make is Boeing the chances of long landing are higher than that for the aircraft Airbus
  3. For every 1 mph increase in the value of speed air, the chances of long landing increase by 240%.

Compare the two models built for long.landing and risky.landing

Step 11: Summarize difference between two models

  1. The model for long landing has three significant predictors aircraft, speed_air and height whereas the model for risky landing has two significant predictors aircraft and speed_air. The variable height loses its significance in predicting risky landing.
  2. The final model for long landing is based on stepwise selection with AIC criterion whereas that of risky landing is the full model.
  3. The odds ratio for aircraft is much higher for long landing than risky landing

Step 12: Plot the ROC curve for each model (sensitivity vs 1-specificity)

thresh <- seq(0.01,0.5,0.01)
predprob_l <- predict(aic.model.l, type = "response")
predprob_r <- predict(aic.model.r, type = "response")
long.landing_l <- long.landing[!is.na(FAA_b$speed_air) & !is.na(FAA_b$duration)]

sensitivity <- specificity<-rep(NA,length(thresh))

for(j in seq(along=thresh)) {
  pp<-ifelse(predprob_l < thresh[j], "no", "yes")
  xx<-xtabs(~long.landing_l + pp, FAA_b)
  specificity[j] <- xx[1,1]/(xx[1,1] + xx[1,2])
  sensitivity[j] <- xx[2,2]/(xx[2,1] + xx[2,2])
}

par(mfrow= c(1,2))
matplot(thresh,cbind(sensitivity, specificity), type="l", xlab="Threshold", ylab="Proportion", lty=1:2)
plot(1-specificity, sensitivity, type="l");abline(0, 1, lty=2)

thresh <- seq(0.01, 0.5, 0.01)
predprob_l <- predict(aic.model.l, type = "response")
predprob_r <- predict(aic.model.r, type = "response")
risky.landing_r <- risky.landing[!is.na(FAA_b$speed_air) & !is.na(FAA_b$duration)]

sensitivity.r <- specificity.r <- rep(NA,length(thresh))

for(j in seq(along=thresh)) {
  pp <- ifelse(predprob_r < thresh[j], "no", "yes")
  xx <- xtabs(~risky.landing_r + pp, FAA_b)
  specificity.r[j] <- xx[1, 1]/(xx[1, 1] + xx[1, 2])
  sensitivity.r[j] <- xx[2, 2]/(xx[2, 1] + xx[2, 2])
}
par(mfrow=c(1,2))
matplot(thresh,cbind(sensitivity.r,specificity.r),type="l",xlab="Threshold",ylab="Proportion",lty=1:2)
plot(1-specificity.r,sensitivity.r,type="l");abline(0,1,lty=2)

plot(1-specificity,sensitivity, type="l", col="blue")
points(1-specificity.r,sensitivity.r,type="l",col="red")
lines(1-specificity.r,sensitivity.r, col="red",lty=2)

auc(long.landing_l, predprob_l)
## Area under the curve: 0.996
auc(risky.landing_r, predprob_r)
## Area under the curve: 0.9973

The ROC curves for the two final models for long landing and risky landing were observed. The plots indicate the trade-off between sensitivity and specificity i.e. when sensitivity is increased, specificity decreases and vice versa.The more area under the ROC curve, the better the predictive power of the model.We have a very high AUC value for both long landing and risky landing, however, the AUC for risky landing is slightly higher than that for long landing model.

Step 13: for the given set of parameters, predict probability of being a long landing or risky landing**

Parameters:

Aircraft Boeing
duration 200
no_pasg 80
speed_ground 115
speed_air 120
height 40
pitch 4
info <- data.frame(aircraft = character(), duration = numeric(), no_pasg = numeric(), speed_ground = numeric(), speed_air = numeric(), height = numeric(), pitch = numeric(), stringsAsFactors = FALSE)

info <- rbind(
  info, 
  list("boeing", 200, 80, 115, 120, 40, 4))
colnames(info) <- c("aircraft", "duration", "no_pasg", "speed_ground", "speed_air", "height", "pitch")


pred_l <- predict(aic.model.l, info, type = "response", se.fit = T) 
pred_r <- predict(full.model.r, info, type = "response", se.fit = T)
paste("Predict value for long landing = ", pred_l$fit[["1"]])
## [1] "Predict value for long landing =  1"
long.confidence <- c(pred_l$fit-2*pred_l$se.fit[1],pred_l$fit+2*pred_l$se.fit[1])

paste("predict value for risky landing = ", pred_r$fit[["1"]])
## [1] "predict value for risky landing =  0.999788977010519"
c(pred_r$fit-2*pred_r$se.fit[1],pred_r$fit+2*pred_r$se.fit[1])
##         1         1 
## 0.9989074 1.0006706

The probability of long and risky landing was predicted for the given values of the variables. Both the probabilities were very high and estimated to be 1. The 95% confidence interval for long landing is very very narrow and it only includes the single point estimate 1. The 95% confidence interval for risky landing is also very narrow ranging from 0.9999997 to 1.0000002 i.e from 0.9999997 to 1 as probability cannot exceed 1.