Part 2: Model binary response using logistic regression

Create binary response

Step 1: Attach binary variables

This section utilizes the cleaned dataset generated in Part 1. Two binary variables long.landing and risky.landing are created and added to FAA_c as per following rule.

\[long.landing = 1\;if\;distance > 2500; =0\;otherwise\]

\[risky.landing = 1\;if\;distance > 3000; =0\;otherwise\]

long.landing <- ifelse(FAA_c$distance > 2500, 1, 0)
risky.landing <- ifelse(FAA_c$distance > 3000, 1, 0)

FAA_b <- cbind(FAA_c[-8], long.landing, risky.landing)

The continuous distance field is not used and instead the above mentioned binary fields are used.

There are 103 instances of long landing and 61 instances of risky landing.

Identify important parameters for predicting `long.landing` binary variable

(Long Landing) Step 2

Since long.landing is binary, it will only have value 0 or 1. Their distribution is as follows:

ggplot(FAA_b, aes(x = long.landing)) + geom_bar()

Distribution of long.landing variable

pct<-round(table(FAA_b$long.landing)/length(FAA_b$long.landing)*100,1)
labs<-c("No","Yes")
labs<-paste(labs,pct)
labs<-paste(labs,"%",sep="" )
pie(table(FAA_b$long.landing),labels=labs,col=rainbow(length(labs)),
    main="Pie chart of whether it was a long landing")

It is observed that 12.4% of the observations are long landings. Thus, the data for the response variable long.landing is unbalanced.

(Long Landing) Step 3

A single factor regression analysis against response long.landing is performed for each potential risk factor. The code is:

glm_aircraft <- glm(long.landing ~ aircraft, 
                    data = FAA_b, family = "binomial")
glm_duration <- glm(long.landing ~ duration, 
                    data = FAA_b, family = "binomial")
glm_no_pasg <- glm(long.landing ~ no_pasg, 
                   data = FAA_b, family = "binomial")
glm_speed_ground <- glm(long.landing ~ speed_ground, 
                        data = FAA_b, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

glm_speed_air <- glm(long.landing ~ speed_air, 
                     data = FAA_b, family = "binomial")
glm_height <- glm(long.landing ~ height, 
                  data = FAA_b, family = "binomial")
glm_pitch <- glm(long.landing ~ pitch, 
                 data = FAA_b, family = "binomial")

The factors are scaled in order to assess their impact on the response variable on the basis of the coefficient estimates.

FAA_b$duration_n <- 
  (FAA_b$duration - mean(FAA_b$duration, na.rm = T))/sd(FAA_b$duration, na.rm = T)
FAA_b$no_pasg_n <- 
  (FAA_b$no_pasg - mean(FAA_b$no_pasg, na.rm = T))/sd(FAA_b$no_pasg, na.rm = T)
FAA_b$speed_ground_n <- 
  (FAA_b$speed_ground - mean(FAA_b$speed_ground, na.rm = T))/sd(FAA_b$speed_ground, na.rm = T)
FAA_b$speed_air_n <- 
  (FAA_b$speed_air - mean(FAA_b$speed_air, na.rm = T))/sd(FAA_b$speed_air, na.rm = T)
FAA_b$height_n <- 
  (FAA_b$height - mean(FAA_b$height, na.rm = T))/sd(FAA_b$height, na.rm = T)
FAA_b$pitch_n <- 
  (FAA_b$pitch - mean(FAA_b$pitch, na.rm = T))/sd(FAA_b$pitch, na.rm = T)

A single factor regression analysis was again performed against response long.landing for each of the scaled predictors. The code is:

glm_aircraft_n <- glm(long.landing ~ aircraft, 
                      data = FAA_b, family = "binomial")
glm_duration_n <- glm(long.landing ~ duration_n, 
                      data = FAA_b, family = "binomial")
glm_no_pasg_n <- glm(long.landing ~ no_pasg_n, 
                     data = FAA_b, family = "binomial")
glm_speed_ground_n <- glm(long.landing ~ speed_ground_n, 
                          data = FAA_b, family = "binomial")
glm_speed_air_n <- glm(long.landing ~ speed_air_n, 
                       data = FAA_b, family = "binomial")
glm_height_n <- glm(long.landing ~ height_n, 
                    data = FAA_b, family = "binomial")
glm_pitch_n <- glm(long.landing ~ pitch_n, 
                   data = FAA_b, family = "binomial")

Variable ranked by P-value
	Variable	Abs. Coefficient	Odds Ratio	Direction	P-Value
4	speed_ground	0.4723458	1.603752	positive	0.0000000
5	speed_air	0.5123218	1.669162	positive	0.0000000
1	aircraft	0.8641199	2.372917	positive	0.0000840
7	pitch	0.4005278	1.492612	positive	0.0466498
6	height	0.0086240	1.008661	positive	0.4218576
3	no_pasg	0.0072564	1.007283	negative	0.6058565
2	duration	0.0010705	1.001071	negative	0.6305122

## [1] "--"

Variable ranked by P-value (Scaled)
	Variable	Abs. Coefficient	Odds Ratio	Direction	P-Value
4	speed_ground	8.8497167	6972.413438	positive	0.0000000
5	speed_air	4.9881068	146.658509	positive	0.0000000
1	aircraft	0.8641199	2.372917	positive	0.0000840
7	pitch	0.2109056	1.234796	positive	0.0466498
6	height	0.0843842	1.088047	positive	0.4218576
3	no_pasg	0.0543600	1.055865	negative	0.6058565
2	duration	0.0517582	1.053121	negative	0.6305122

Based on these criteria speed_ground, speed_air, aircraft and pitch were identified as statistically important factors for analysis.

Note: The odds ratio for long.landing is higher when aircraft make is Boeing.

(Long Landing) Step 4: Visualization

The above factors and their relation with long.landing is shown below. The charts represent various graphical representations of the significant factors - Aircraft, Speed Ground, Speed Air and Pitch.

plot(as.factor(long.landing) ~ speed_ground, data = FAA_b)

plot(as.factor(long.landing) ~ speed_air, data = FAA_b)

plot(as.factor(long.landing) ~ pitch, data = FAA_b)

plot(jitter(long.landing,0.1)~jitter(speed_ground),FAA_b,xlab="Speed Ground",
     ylab="Long Landing",pch=".")

plot(jitter(long.landing,0.1)~jitter(speed_air),FAA_b,xlab="Speed Air",
     ylab="Long Landing",pch=".")

plot(jitter(long.landing,0.1)~jitter(pitch),FAA_b,xlab="Pitch",
     ylab="Long Landing",pch=".")

ggplot(FAA_b,aes(x=aircraft,fill=as.factor(long.landing)))+
  geom_bar(position="dodge")

ggplot(FAA_b,aes(x=speed_ground,fill=as.factor(long.landing)))+
  geom_histogram(position="dodge",binwidth=1)

ggplot(FAA_b,aes(x=speed_air,fill=as.factor(long.landing)))+
  geom_histogram(position="dodge",binwidth=1)

ggplot(FAA_b,aes(x=pitch,fill=as.factor(long.landing)))+
  geom_histogram(position="dodge",binwidth=1)

The significant variables and their relationship with long landing were visualized using a bar chart, scatter plot and histogram.

Based on the above charts and graphs, the following conclusions can be made for each significant variable:

Speed_ground : All the instances of long landing are corresponding to values of speed_ground greater than 90mph
Speed_air : All the instances of long landing are corresponding to values of speed_air greater than 95mph
Aircraft : The ratio long landing / not a long landing is significantly higher for aircraft make Boeing. The absolute number of instances of long landing are also higher for Boeing.
Pitch : There is a slight increase in the number of long landing instances for a higher pitch as compared to a lower pitch value.

(Long Landing) Step 5: Create full model

A full model is created based on output of last two steps after performing a collinearity check for the factors speed_ground and speed_air.

# checking for collinearity
model1 <- glm(long.landing ~ speed_ground, 
              data = FAA_b, family = "binomial")
model2 <- glm(long.landing ~ speed_air, 
              data = FAA_b, family = "binomial")
model3 <- glm(long.landing ~ speed_ground + speed_air, 
              data = FAA_b, family = "binomial")

correlation <- cor(FAA_b$speed_ground, FAA_b$speed_air, use = "complete.obs")

The correlation value between speed_ground and speed_air is 0.9879383.

AIC value for speed_ground: 119.4703208
AIC value for speed_air: 106.3274885
AIC value for speed_ground + speed_air: 108.0278797

The factors speed_groud and speed_air show a very high correlation. Based on the AIC values of the three models, the factor speed_air has the lowest AIC value. However, speed_air has 628 missing values while speed_ground has 0 missing values. Hence, speed_air is removed from further analysis and only speed_ground is included.

The full model is created based on the cleaned dataset using following snippet. Its various metrics are shown below:

full.model.l <- glm(long.landing ~ aircraft + speed_ground + pitch, 
  data = FAA_b, 
  family = "binomial")

s <- summary(full.model.l)
kable(s$coefficients, caption = "Model Coefficients", digits = 3)

Model Coefficients
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-67.929	10.484	-6.479	0.000
aircraftboeing	3.043	0.733	4.150	0.000
speed_ground	0.615	0.092	6.694	0.000
pitch	1.066	0.604	1.765	0.078

The variable pitch is not significant for this final model. Aircraft make and speed_ground are the only two variables impacting the response long.landing significantly.

AIC for the model is 89.30904

(Long Landing) Step 6: Forward variable selection using AIC (using `step()` method)

Another model was built using forward stepwise selection with AIC criterion on all the factors.

model0 <- glm(
  long.landing ~ aircraft + duration + no_pasg + speed_ground + speed_air + height + pitch,
  data = FAA_b, 
  family = "binomial")

aic.model.l <- step(model0, trace = 0, direction = "forward")

Coefficients for Forward Variable Selection using AIC
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-196.359	56.071	-3.502	0.000
aircraftboeing	8.784	2.623	3.349	0.001
height	0.423	0.143	2.956	0.003
speed_air	1.985	0.708	2.804	0.005
pitch	1.469	1.055	1.392	0.164
no_pasg	-0.074	0.070	-1.050	0.294
speed_ground	-0.226	0.384	-0.587	0.557
duration	0.000	0.010	0.029	0.977

Other parameters of AIC model are:

Null deviance: 270.1991812 on 194 on degrees of freedom
Residual deviance: 32.9094605 on 187 degrees of freedom
AIC: 48.9094605
Number of Fischer scoring iterations: 10
Observations deleted due to missigness: 636

The forward selection model with AIC criterion provided aircraft, speed_air and height as significant factors. height is significant in this model and was not significant in the model built in step 5.
AIC value for this model is 48.9094605 which is much lower than the full model in step 5, suggesting this is a better model.

(Long Landing) Step 7: Forward variable selection using BIC (using `step()` method)

Next, a model was built using forward stepwise selection with BIC criterion on all the factors.

bic.model.l <- step(model0, trace = 0, direction = "forward", criterion = "BIC")

Coefficients for Forward Variable Selection using BIC
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-196.359	56.071	-3.502	0.000
aircraftboeing	8.784	2.623	3.349	0.001
height	0.423	0.143	2.956	0.003
speed_air	1.985	0.708	2.804	0.005
pitch	1.469	1.055	1.392	0.164
no_pasg	-0.074	0.070	-1.050	0.294
speed_ground	-0.226	0.384	-0.587	0.557
duration	0.000	0.010	0.029	0.977

Other parameters of BIC model are:

Null deviance: 270.199 on 194 on degrees of freedom
Residual deviance: 32.909 on 187 degrees of freedom
BIC: 48.9094605
Number of Fischer scoring iterations: 10
Observations deleted due to missigness: 636

The forward selection model with BIC criterion provided aircraft, speed_air and height as significant factors. The first two factors are same as the ones obtained in the model in Step 5. However, height is significant in this model and was not significant in the model built in step 5. This is the same model as obtained in step 6
AIC value for this model is 48.9094605 which is much lower than the model in step 5, suggesting this is a better model.

Step 8: Risk factors for long landing and how they impact its occurrence.

Given a limited opportunity to make my case, I would prefer following information and approach:

Model: The model has aircraft, speed_air and height as the significant predictors for the response variable long landing.

Table:

ggplot(FAA_b,aes(x=aircraft,fill=as.factor(long.landing)))+
  geom_bar(position="dodge")

ggplot(FAA_b,aes(x=speed_air,fill=as.factor(long.landing)))+
  geom_histogram(position="dodge",binwidth=1)

## Warning: Removed 628 rows containing non-finite values (stat_bin).

ggplot(FAA_b,aes(x=height,fill=as.factor(long.landing)))+
  geom_histogram(position="dodge",binwidth=1)

Executive summary:

The data is unbalanced and has 103 entries with an instance of long landing out of 831
If the aircraft make is Boeing the chances of long landing are drastically higher than that for the aircraft Airbus
For every 1 mph increase in the value of speed air, the chances of long landing increase by 628%.
For every 1 m increase in the value of height, the chances of long landing increase by 53%

Identify important parameters using `risky.landing`

(Risky Landing) Step 2

Since risky.landing is binary, it will only have value 0 or 1. Their distribution is as follows:

ggplot(FAA_b, aes(x = risky.landing)) + geom_bar()

Distribution of long.landing variable

pct<-round(table(FAA_b$risky.landing)/length(FAA_b$risky.landing)*100,1)
labs<-c("No","Yes")
labs<-paste(labs,pct)
labs<-paste(labs,"%",sep="" )
pie(table(FAA_b$risky.landing),labels=labs,col=rainbow(length(labs)),
    main="Pie chart of whether it was a risky landing")

It is observed that 7.5% of the observations are risky landings. Thus, the data for the response variable risky.landing is unbalanced.

(Risky Landing) Step 3

A single factor regression analysis against response risky.landing is performed for each potential risk factor. The code is:

r_glm_aircraft <- glm(risky.landing ~ aircraft, 
                    data = FAA_b, family = "binomial")
r_glm_duration <- glm(risky.landing ~ duration, 
                    data = FAA_b, family = "binomial")
r_glm_no_pasg <- glm(risky.landing ~ no_pasg, 
                   data = FAA_b, family = "binomial")
r_glm_speed_ground <- glm(risky.landing ~ speed_ground, 
                        data = FAA_b, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

r_glm_speed_air <- glm(risky.landing ~ speed_air, 
                     data = FAA_b, family = "binomial")
r_glm_height <- glm(risky.landing ~ height, 
                  data = FAA_b, family = "binomial")
r_glm_pitch <- glm(risky.landing ~ pitch, 
                 data = FAA_b, family = "binomial")

The factors were already scaled in order to assess their impact on the response variable on the basis of the coefficient estimates.

A single factor regression analysis was again performed against response risky.landing for each of the scaled predictors.

r_glm_aircraft_n <- glm(risky.landing ~ aircraft, 
                      data = FAA_b, family = "binomial")
r_glm_duration_n <- glm(risky.landing ~ duration_n, 
                      data = FAA_b, family = "binomial")
r_glm_no_pasg_n <- glm(risky.landing ~ no_pasg_n, 
                     data = FAA_b, family = "binomial")
r_glm_speed_ground_n <- glm(risky.landing ~ speed_ground_n, 
                          data = FAA_b, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

r_glm_speed_air_n <- glm(risky.landing ~ speed_air_n, 
                       data = FAA_b, family = "binomial")
r_glm_height_n <- glm(risky.landing ~ height_n, 
                    data = FAA_b, family = "binomial")
r_glm_pitch_n <- glm(risky.landing ~ pitch_n, 
                   data = FAA_b, family = "binomial")

Variable ranked by P-value
	Variable	Abs. Coefficient	Odds Ratio	Direction	P-Value
4	speed_ground	0.6142187	1.848212	positive	0.0000001
5	speed_air	0.8704019	2.387870	positive	0.0000037
1	aircraft	1.0017753	2.723112	positive	0.0004561
7	pitch	0.3710720	1.449287	positive	0.1432961
3	no_pasg	0.0253793	1.025704	negative	0.1536237
2	duration	0.0011518	1.001152	negative	0.6801987
6	height	0.0022186	1.002221	negative	0.8705917

## [1] "--"

Variable ranked by P-value (Scaled)
	Variable	Abs. Coefficient	Odds Ratio	Direction	P-Value
4	speed_ground	11.5078031	99489.071286	positive	0.0000001
5	speed_air	8.4744743	4790.903747	positive	0.0000037
1	aircraft	1.0017753	2.723112	positive	0.0004561
7	pitch	0.1953950	1.215791	positive	0.1432961
3	no_pasg	0.1901247	1.209400	negative	0.1536237
2	duration	0.0556912	1.057271	negative	0.6801987
6	height	0.0217086	1.021946	negative	0.8705917

The factors are arranged in order of their importance based on the p-values. The p-value should be less than 0.05 for a 95% level of confidence used while testing the null hypothesis that the factor does not impact the response variable. The smaller the p-value, the more significant the factor. Based on these criteria speed_ground, speed_air and aircraft were identified as statistically important factors for analysis.

Note: The odds ratio for risky.landing is higher when aircraft make is Boeing.

(Risky Landing) Step 4: Visualization

The above factors and their relation with risky.landing is shown below. The charts represent various graphical representations of the significant factors - Aircraft, Ground Speed, Air Speed

plot(as.factor(risky.landing) ~ speed_ground, data = FAA_b)

plot(as.factor(risky.landing) ~ speed_air, data = FAA_b)

plot(jitter(risky.landing,0.1)~jitter(speed_ground),FAA_b,xlab="Speed Ground",
     ylab="Risky Landing",pch=".")

plot(jitter(risky.landing,0.1)~jitter(speed_air),FAA_b,xlab="Speed Air",
     ylab="Risky Landing",pch=".")

ggplot(FAA_b,aes(x=aircraft,fill=as.factor(risky.landing)))+
  geom_bar(position="dodge")

ggplot(FAA_b,aes(x=speed_ground,fill=as.factor(risky.landing)))+
  geom_histogram(position="dodge",binwidth=1)

ggplot(FAA_b,aes(x=speed_air,fill=as.factor(risky.landing)))+
  geom_histogram(position="dodge",binwidth=1)

## Warning: Removed 628 rows containing non-finite values (stat_bin).

The significant variables and their relationship with risky landing were visualized using a bar chart, scatter plot and histogram.

Based on the above charts and graphs, the following conclusions can be made for each significant variable:

Speed_ground : All the instances of risky landing are corresponding to values of speed_ground greater that approximately 100mph
Speed_air : All the instances of risky landing are corresponding to values of speed_air greater than 100mph
Aircraft : The ratio long landing / not a risky landing is significantly higher for aircraft make Boeing. The absolute number of instances of long landing are also higher for Boeing.

(Risky Landing) Step 5: Create full model

A full model is created based on output of last two steps after performing a collinearity check for the factors speed_ground and speed_air.

# checking for collinearity
r_model1 <- glm(risky.landing ~ speed_ground, 
              data = FAA_b, family = "binomial")
r_model2 <- glm(risky.landing ~ speed_air, 
              data = FAA_b, family = "binomial")
r_model3 <- glm(risky.landing ~ speed_ground + speed_air,
              data = FAA_b, family = "binomial")

correlation <- cor(FAA_b$speed_ground, FAA_b$speed_air, use = "complete.obs")

The correlation value between speed_ground and speed_air is 0.9879383.

AIC value for speed_ground: 62.9306044
AIC value for speed_air: 48.5802507
AIC value for speed_ground + speed_air: 49.9528945

The factors speed_ground and speed_air show a very high correlation. Based on the AIC values of the three models, the factor speed_air has the lowest AIC value. However, speed_air has 628 missing values while speed_ground has 0 missing values. Hence, speed_air is removed from further analysis and only speed_ground is included.

The full model is created based on the cleaned dataset using following snippet:

full.model.r <- glm(risky.landing ~ aircraft + speed_ground, 
  data = FAA_b, 
  family = "binomial")

Model Coefficients
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-102.077	24.775	-4.120	0.000
aircraftboeing	4.019	1.249	3.217	0.001
speed_ground	0.926	0.225	4.121	0.000

Aircraft make and ground speed are impacting the response risky.landing significantly.

AIC of model is 46.0965964

(Risky Landing) Step 6: Forward variable selection using AIC (using `step()` method)

The model is creating using following snippet:

model0_r <- glm(
  risky.landing ~ aircraft + duration + no_pasg + speed_ground + speed_air + height + pitch,
  data = FAA_b, 
  family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

aic.model.r <- step(model0_r, trace = 0, direction = "forward")

Coefficients for Forward Variable Selection using AIC
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-149.419	48.425	-3.086	0.002
speed_air	1.617	0.654	2.472	0.013
aircraftboeing	7.330	3.002	2.442	0.015
no_pasg	-0.120	0.096	-1.253	0.210
pitch	-1.316	1.430	-0.920	0.357
height	0.045	0.058	0.786	0.432
speed_ground	-0.164	0.498	-0.328	0.743
duration	0.002	0.016	0.125	0.901

Other parameters of AIC model are:

Null deviance: 240.7242902 on 194 on degrees of freedom
Residual deviance: 22.1439647 on 187 degrees of freedom
AIC: 38.1439647
Number of Fischer scoring iterations: 10
Observations deleted due to missigness: 636

The forward selection model with AIC criterion also provided aircraft and speed_air as significant factors. These are same as the variables selected in step 5
AIC value for this model is 38.1439647 which is much lower than the model in step 5 (46.0965964), suggesting this is a better model.

(Risky Landing) Step 7: Forward variable selection using BIC (using `step()` method)

bic.model.r <- step(model0_r, trace = 0, direction = "forward", criterion = "BIC")

Coefficients for Forward Variable Selection using BIC
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-149.419	48.425	-3.086	0.002
speed_air	1.617	0.654	2.472	0.013
aircraftboeing	7.330	3.002	2.442	0.015
no_pasg	-0.120	0.096	-1.253	0.210
pitch	-1.316	1.430	-0.920	0.357
height	0.045	0.058	0.786	0.432
speed_ground	-0.164	0.498	-0.328	0.743
duration	0.002	0.016	0.125	0.901

Other parameters of BIC model are:

Null deviance: 240.724 on 194 on degrees of freedom
Residual deviance: 22.144 on 187 degrees of freedom
AIC: 38.1439647
Number of Fischer scoring iterations: 10
Observations deleted due to missigness: 636

The forward selection model with BIC criterion also provides the same model as in step 6. Aircraft and speed_air are significant factors. These are same as the variables selected in step 5. AIC value for this model is 38.1439647 which is much lower than the model in step 5 (46.0965964), suggesting this is a better model.

In the full model, aircraft and speed_ground was used. As per AIC as well as BIC, aircraft and speed_air are significant factors.

(Risky Landing) Step 10: Risk factors for risky landing and how they impact its occurrence.**

Given a limited opportunity to make my case, I would prefer following information and approach:

Model: The model has aircraft and speed_air as the significant predictors for the response variable.

Table:

ggplot(FAA_b,aes(x=aircraft,fill=as.factor(risky.landing)))+
  geom_bar(position="dodge")

ggplot(FAA_b,aes(x=speed_air,fill=as.factor(risky.landing)))+
  geom_histogram(position="dodge",binwidth=1)

## Warning: Removed 628 rows containing non-finite values (stat_bin).

Executive summary:

The data is unbalanced and has 61 entries with an instance of long landing out of 831
If the aircraft make is Boeing the chances of long landing are higher than that for the aircraft Airbus
For every 1 mph increase in the value of speed air, the chances of long landing increase by 240%.

Compare the two models built for `long.landing` and `risky.landing`

Step 11: Summarize difference between two models

The model for long landing has three significant predictors aircraft, speed_air and height whereas the model for risky landing has two significant predictors aircraft and speed_air. The variable height loses its significance in predicting risky landing.
The final model for long landing is based on stepwise selection with AIC criterion whereas that of risky landing is the full model.
The odds ratio for aircraft is much higher for long landing than risky landing

Step 12: Plot the ROC curve for each model (sensitivity vs 1-specificity)

thresh <- seq(0.01,0.5,0.01)
predprob_l <- predict(aic.model.l, type = "response")
predprob_r <- predict(aic.model.r, type = "response")
long.landing_l <- long.landing[!is.na(FAA_b$speed_air) & !is.na(FAA_b$duration)]

sensitivity <- specificity<-rep(NA,length(thresh))

for(j in seq(along=thresh)) {
  pp<-ifelse(predprob_l < thresh[j], "no", "yes")
  xx<-xtabs(~long.landing_l + pp, FAA_b)
  specificity[j] <- xx[1,1]/(xx[1,1] + xx[1,2])
  sensitivity[j] <- xx[2,2]/(xx[2,1] + xx[2,2])
}

par(mfrow= c(1,2))
matplot(thresh,cbind(sensitivity, specificity), type="l", xlab="Threshold", ylab="Proportion", lty=1:2)
plot(1-specificity, sensitivity, type="l");abline(0, 1, lty=2)

thresh <- seq(0.01, 0.5, 0.01)
predprob_l <- predict(aic.model.l, type = "response")
predprob_r <- predict(aic.model.r, type = "response")
risky.landing_r <- risky.landing[!is.na(FAA_b$speed_air) & !is.na(FAA_b$duration)]

sensitivity.r <- specificity.r <- rep(NA,length(thresh))

for(j in seq(along=thresh)) {
  pp <- ifelse(predprob_r < thresh[j], "no", "yes")
  xx <- xtabs(~risky.landing_r + pp, FAA_b)
  specificity.r[j] <- xx[1, 1]/(xx[1, 1] + xx[1, 2])
  sensitivity.r[j] <- xx[2, 2]/(xx[2, 1] + xx[2, 2])
}
par(mfrow=c(1,2))
matplot(thresh,cbind(sensitivity.r,specificity.r),type="l",xlab="Threshold",ylab="Proportion",lty=1:2)
plot(1-specificity.r,sensitivity.r,type="l");abline(0,1,lty=2)

plot(1-specificity,sensitivity, type="l", col="blue")
points(1-specificity.r,sensitivity.r,type="l",col="red")
lines(1-specificity.r,sensitivity.r, col="red",lty=2)

auc(long.landing_l, predprob_l)

## Area under the curve: 0.996

auc(risky.landing_r, predprob_r)

## Area under the curve: 0.9973

The ROC curves for the two final models for long landing and risky landing were observed. The plots indicate the trade-off between sensitivity and specificity i.e. when sensitivity is increased, specificity decreases and vice versa.The more area under the ROC curve, the better the predictive power of the model.We have a very high AUC value for both long landing and risky landing, however, the AUC for risky landing is slightly higher than that for long landing model.

Step 13: for the given set of parameters, predict probability of being a long landing or risky landing**

Parameters:

Aircraft	Boeing
duration	200
no_pasg	80
speed_ground	115
speed_air	120
height	40
pitch	4

info <- data.frame(aircraft = character(), duration = numeric(), no_pasg = numeric(), speed_ground = numeric(), speed_air = numeric(), height = numeric(), pitch = numeric(), stringsAsFactors = FALSE)

info <- rbind(
  info, 
  list("boeing", 200, 80, 115, 120, 40, 4))
colnames(info) <- c("aircraft", "duration", "no_pasg", "speed_ground", "speed_air", "height", "pitch")


pred_l <- predict(aic.model.l, info, type = "response", se.fit = T) 
pred_r <- predict(full.model.r, info, type = "response", se.fit = T)

paste("Predict value for long landing = ", pred_l$fit[["1"]])

## [1] "Predict value for long landing =  1"

long.confidence <- c(pred_l$fit-2*pred_l$se.fit[1],pred_l$fit+2*pred_l$se.fit[1])

paste("predict value for risky landing = ", pred_r$fit[["1"]])

## [1] "predict value for risky landing =  0.999788977010519"

c(pred_r$fit-2*pred_r$se.fit[1],pred_r$fit+2*pred_r$se.fit[1])

##         1         1 
## 0.9989074 1.0006706

The probability of long and risky landing was predicted for the given values of the variables. Both the probabilities were very high and estimated to be 1. The 95% confidence interval for long landing is very very narrow and it only includes the single point estimate 1. The 95% confidence interval for risky landing is also very narrow ranging from 0.9999997 to 1.0000002 i.e from 0.9999997 to 1 as probability cannot exceed 1.

Compare models with different link functions

Step 14

The response variable risky.landing was predicted using logit, probit and cloglog links. The results are compared below.

model.logit.r <- glm(
  risky.landing ~ aircraft + speed_air, 
  data = FAA_b, 
  family = binomial(link = logit))

model.probit.r <- glm(
  risky.landing ~ aircraft + speed_air, 
  data = FAA_b, 
  family = binomial(link = probit))

model.cloglog.r <- glm(
  risky.landing ~ aircraft + speed_air, 
  data = FAA_b, 
  family = binomial(link = cloglog))

logit.summary <- summary(model.logit.r)
probit.summary <- summary(model.probit.r)
cloglog.summary <- summary(model.cloglog.r)

Deviance Residuals

Logit Link

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.6729040 -0.0081175 -0.0006824 -0.0024089  0.0000503  2.4773894

Probit Link

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.6957399 -0.0001181  0.0000000 -0.0056729  0.0000000  2.3889898

CLog-Log Link

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.529384 -0.020475 -0.002957 -0.022744  0.000000  2.470875

Coefficients

Coefficients for Logit Link
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-134.0859	33.3811	-4.0168	0.0001
aircraftboeing	4.5648	1.5081	3.0268	0.0025
speed_air	1.2240	0.3052	4.0101	0.0001

Coefficients for Probit Link
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-74.2249	16.7591	-4.4289	0e+00
aircraftboeing	2.6447	0.7916	3.3409	8e-04
speed_air	0.6773	0.1533	4.4182	0e+00

Coefficients for CLog-Log Link
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-103.6808	25.5564	-4.0569	0.0000
aircraftboeing	3.2606	0.9964	3.2725	0.0011
speed_air	0.9423	0.2331	4.0427	0.0001

Other info

Comparison between various link functions
	Null deviance	Residual deviance	AIC	No. of Fischer scoring iterations
Logit Link	248.18 on 202 df	26.296 on 200 df	32	9
Probit Link	248.18 on 202 df	26.138 on 200 df	32	11
CLog-Log Link	248.18 on 202 df	24.363 on 200 df	30	12

The model for risky landing was built using logit, probit and complementary log-log link. The following conclusions were made on comparing the three models:

All the three models are very similar to each other in terms of Null deviance
The residual deviance is slightly better for clogog model.
The cloglog model has the lowest AIC value
Logit model has the highest coefficient values and Probit model has the lowest
cloglog is a sligtly better linkage parameter for modeling risky landing.

Step 15

pred_logit <- predict(model.logit.r, type = "response")
pred_probit <- predict(model.probit.r, type = "response")
pred_cloglog <- predict(model.cloglog.r, type = "response")

ROC_logit <- roc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_logit)
ROC_probit <- roc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_probit)
ROC_cloglog <- roc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_cloglog)

plot(ROC_logit)
lines(ROC_probit, col= 'blue')
lines(ROC_cloglog, col= "red")

auc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_logit)

## Area under the curve: 0.9964

auc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_probit)

## Area under the curve: 0.9963

auc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_cloglog)

## Area under the curve: 0.9964

All the three ROC curves are very closely overlapping each other. As it is difficult to compare the ROC curves, the AUC values were used to compare the models. Based on the AUC values, logit and cloglog model have the same area under the curve which is very slightly higher than that of the probit model.

Step 16

pred_logit.index <- sort(as.numeric(names(tail(sort(pred_logit), 5))))
pred_logit.sort <- FAA_b[pred_logit.index,1:7]
kable(pred_logit.sort, caption = "Predict Full Model", digits = 4)

Predict Full Model
	aircraft	duration	no_pasg	speed_ground	speed_air	height	pitch
64	boeing	161.8925	72	129.2649	128.4177	33.9490	4.1400
176	boeing	197.5464	68	126.6692	127.9641	23.7642	2.9932
307	boeing	154.5246	67	129.3072	127.5933	23.9785	5.1547
362	boeing	63.3295	52	132.7847	132.9115	18.1770	4.1107
408	airbus	131.7311	60	131.0352	131.3379	28.2780	3.6602

pred_probit.index <- sort(as.numeric(names(tail(sort(pred_probit), 5))))
pred_probit.sort <- FAA_b[pred_probit.index,1:7]
kable(pred_probit.sort, caption = "Probit Model", digits = 4)

Probit Model
	aircraft	duration	no_pasg	speed_ground	speed_air	height	pitch
387	boeing	153.8345	61	126.8393	126.1186	20.5478	4.3346
408	airbus	131.7311	60	131.0352	131.3379	28.2780	3.6602
643	airbus	137.5857	66	126.2443	127.9371	35.1757	2.7019
751	airbus	175.5144	49	125.2123	125.1385	22.5248	4.3658
769	airbus	98.5003	66	123.3105	124.3908	22.3272	4.2767

pred_cloglog.index <- sort(as.numeric(names(tail(sort(pred_cloglog), 5))))
pred_cloglog.sort <- FAA_b[pred_cloglog.index,1:7]
kable(pred_cloglog.sort, caption = "cloglog Model", digits = 4)

cloglog Model
	aircraft	duration	no_pasg	speed_ground	speed_air	height	pitch
669	airbus	140.4531	75	120.4189	118.4847	31.2634	2.7967
751	airbus	175.5144	49	125.2123	125.1385	22.5248	4.3658
765	airbus	220.0571	61	120.5579	118.2882	15.6657	4.1113
769	airbus	98.5003	66	123.3105	124.3908	22.3272	4.2767
821	airbus	NA	58	113.4274	114.8444	40.1011	4.2484

all.3.df <- cbind(pred_logit.index, pred_probit.index, pred_cloglog.index);
colnames(all.3.df) <- c("pred_logit", "pred_probit", "pred_cloglog")
kable(all.3.df, caption="Top 5 predictions")

Top 5 predictions
pred_logit	pred_probit	pred_cloglog
64	387	669
176	408	751
307	643	765
362	751	769
408	769	821

pred_probit and pred_cloglog share two values in top 5 (Index 751 and 769). Apart from these two, there are no common values across the 3 columns.

Step 17

model.logit.l <- glm(
    long.landing ~ aircraft + speed_air + height, 
    data = FAA_b, 
    family = binomial(link = "logit"))
model.probit.l <- glm(
    long.landing ~ aircraft + speed_air + height, 
    data = FAA_b, 
    family = binomial(link = "probit"))
model.cloglog.l <- glm(
    long.landing ~ aircraft + speed_air + height, 
    data = FAA_b, 
    family = binomial(link = "cloglog"))

pred_l_logit <- predict(model.logit.l, info, type = "response", se.fit = T) 
pred_l_probit <- predict(model.probit.l, info, type = "response", se.fit = T)
pred_l_cloglog <- predict(model.cloglog.l, info, type = "response", se.fit = T) 

pred_r_logit <- predict(full.model.r, info, type = "response", se.fit = T)
pred_r_probit <- predict(model.probit.r, info, type = "response", se.fit = T)
pred_r_cloglog <- predict(model.cloglog.r, info, type = "response", se.fit = T)

pred_l <- list(pred_l_logit, pred_l_probit, pred_l_cloglog)
pred_r <- list(pred_r_logit, pred_r_probit, pred_r_cloglog)

Estimate for Long Landing
	Estimated Probability	CI Lower Limit	CI Upper Limit
Logit Link	1	1	1
Probit Link	1	1	1
CLog-Log Link	1	1	1

Estimate for Risky Landing
	Estimated Probability	CI Lower Limit	CI Upper Limit
Logit Link	0.999789	0.9989074	1.000671
Probit Link	1.000000	1.0000000	1.000000
CLog-Log Link	1.000000	1.0000000	1.000000

The predicted probablity for both long landing and risky landing for logit, probit and cloglog link models is 1. These models predict very narrow confidence intervals. The 95% confidence interval for long landing for all three linkges is very very narrow and it only includes the single point estimate 1. The 95% confidence interval for risky landing is also very narrow ranging from 0.9989074 to 1.000671 i.e from 0.9989074 to 1 as probability cannot exceed 1 for logit link and includes only the point estimate for probit and cloglog link. Thus, it can be concluded that for the given values of the variables the landing will be long and risky as seen from the results of all the models.

Flight Landing Distance Analysis

Shruti A. Jibhakate

February 18, 2019

Part 1: Data cleaning

Part 2: Model binary response using logistic regression

Create binary response

Step 1: Attach binary variables

Identify important parameters for predicting `long.landing` binary variable

(Long Landing) Step 2

(Long Landing) Step 3

(Long Landing) Step 4: Visualization

(Long Landing) Step 5: Create full model

(Long Landing) Step 6: Forward variable selection using AIC (using `step()` method)

(Long Landing) Step 7: Forward variable selection using BIC (using `step()` method)

Step 8: Risk factors for long landing and how they impact its occurrence.

Identify important parameters using `risky.landing`

(Risky Landing) Step 2

(Risky Landing) Step 3

(Risky Landing) Step 4: Visualization

(Risky Landing) Step 5: Create full model

(Risky Landing) Step 6: Forward variable selection using AIC (using `step()` method)

(Risky Landing) Step 7: Forward variable selection using BIC (using `step()` method)

(Risky Landing) Step 10: Risk factors for risky landing and how they impact its occurrence.**

Compare the two models built for `long.landing` and `risky.landing`

Step 11: Summarize difference between two models

Step 12: Plot the ROC curve for each model (sensitivity vs 1-specificity)

Step 13: for the given set of parameters, predict probability of being a long landing or risky landing**

Compare models with different link functions

Step 14

Step 15

Step 16

Step 17

Flight Landing Distance Analysis

Shruti A. Jibhakate

February 18, 2019

Part 1: Data cleaning

Part 2: Model binary response using logistic regression

Create binary response

Step 1: Attach binary variables

Identify important parameters for predicting long.landing binary variable

(Long Landing) Step 2

(Long Landing) Step 3

(Long Landing) Step 4: Visualization

(Long Landing) Step 5: Create full model

(Long Landing) Step 6: Forward variable selection using AIC (using step() method)

(Long Landing) Step 7: Forward variable selection using BIC (using step() method)

Step 8: Risk factors for long landing and how they impact its occurrence.

Identify important parameters using risky.landing

(Risky Landing) Step 2

(Risky Landing) Step 3

(Risky Landing) Step 4: Visualization

(Risky Landing) Step 5: Create full model

(Risky Landing) Step 6: Forward variable selection using AIC (using step() method)

(Risky Landing) Step 7: Forward variable selection using BIC (using step() method)

(Risky Landing) Step 10: Risk factors for risky landing and how they impact its occurrence.**

Compare the two models built for long.landing and risky.landing

Step 11: Summarize difference between two models

Step 12: Plot the ROC curve for each model (sensitivity vs 1-specificity)

Step 13: for the given set of parameters, predict probability of being a long landing or risky landing**

Compare models with different link functions

Step 14

Step 15

Step 16

Step 17

Identify important parameters for predicting `long.landing` binary variable

(Long Landing) Step 6: Forward variable selection using AIC (using `step()` method)

(Long Landing) Step 7: Forward variable selection using BIC (using `step()` method)

Identify important parameters using `risky.landing`

(Risky Landing) Step 6: Forward variable selection using AIC (using `step()` method)

(Risky Landing) Step 7: Forward variable selection using BIC (using `step()` method)

Compare the two models built for `long.landing` and `risky.landing`