The dataset was loaded and cleaned as part of previous assignment. The cleaned data is stored in FAA_c data frame. A minimal version of the code is as follows:
# Step 1
FAA1 <- read_excel("FAA1.xls", sheet = "FAA1")
FAA2 <- read_excel("FAA2.xls", sheet = "FAA2")
# Step 3
FAA_COMBO <- union_all(FAA1, FAA2)
FAA <- FAA_COMBO[!duplicated(FAA_COMBO[, c(1,3,4,5,6,7,8)]),]
# Step 6
### filtering abnormal values of duration
FAA_c <- FAA[(FAA$duration > 40 | is.na(FAA$duration)),]
### filtering abnormal values of speed_ground
FAA_c <- FAA_c[(FAA_c$speed_ground > 30 & FAA_c$speed_ground < 140),]
### filtering abnormal values of speed_air
FAA_c <- FAA_c[((FAA_c$speed_air > 30 & FAA_c$speed_air < 140) | is.na(FAA_c$speed_air)),]
### filtering abnormal values of height
FAA_c <- FAA_c[FAA_c$height >= 6,]
### filtering abnormal values of distance
FAA_c <- FAA_c[FAA_c$distance < 6000,]
The cleaned dataset is stored as FAA_c. It has 831 observations and 8 variables.
This section utilizes the cleaned dataset generated in Part 1. Two binary variables long.landing and risky.landing are created and added to FAA_c as per following rule.
\[long.landing = 1\;if\;distance > 2500; =0\;otherwise\]
\[risky.landing = 1\;if\;distance > 3000; =0\;otherwise\]
long.landing <- ifelse(FAA_c$distance > 2500, 1, 0)
risky.landing <- ifelse(FAA_c$distance > 3000, 1, 0)
FAA_b <- cbind(FAA_c[-8], long.landing, risky.landing)
The continuous distance field is not used and instead the above mentioned binary fields are used.
There are 103 instances of long landing and 61 instances of risky landing.
long.landing binary variableSince long.landing is binary, it will only have value 0 or 1. Their distribution is as follows:
ggplot(FAA_b, aes(x = long.landing)) + geom_bar()
Distribution of long.landing variable
pct<-round(table(FAA_b$long.landing)/length(FAA_b$long.landing)*100,1)
labs<-c("No","Yes")
labs<-paste(labs,pct)
labs<-paste(labs,"%",sep="" )
pie(table(FAA_b$long.landing),labels=labs,col=rainbow(length(labs)),
main="Pie chart of whether it was a long landing")
It is observed that 12.4% of the observations are long landings. Thus, the data for the response variable
long.landing is unbalanced.
A single factor regression analysis against response long.landing is performed for each potential risk factor. The code is:
glm_aircraft <- glm(long.landing ~ aircraft,
data = FAA_b, family = "binomial")
glm_duration <- glm(long.landing ~ duration,
data = FAA_b, family = "binomial")
glm_no_pasg <- glm(long.landing ~ no_pasg,
data = FAA_b, family = "binomial")
glm_speed_ground <- glm(long.landing ~ speed_ground,
data = FAA_b, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm_speed_air <- glm(long.landing ~ speed_air,
data = FAA_b, family = "binomial")
glm_height <- glm(long.landing ~ height,
data = FAA_b, family = "binomial")
glm_pitch <- glm(long.landing ~ pitch,
data = FAA_b, family = "binomial")
The factors are scaled in order to assess their impact on the response variable on the basis of the coefficient estimates.
FAA_b$duration_n <-
(FAA_b$duration - mean(FAA_b$duration, na.rm = T))/sd(FAA_b$duration, na.rm = T)
FAA_b$no_pasg_n <-
(FAA_b$no_pasg - mean(FAA_b$no_pasg, na.rm = T))/sd(FAA_b$no_pasg, na.rm = T)
FAA_b$speed_ground_n <-
(FAA_b$speed_ground - mean(FAA_b$speed_ground, na.rm = T))/sd(FAA_b$speed_ground, na.rm = T)
FAA_b$speed_air_n <-
(FAA_b$speed_air - mean(FAA_b$speed_air, na.rm = T))/sd(FAA_b$speed_air, na.rm = T)
FAA_b$height_n <-
(FAA_b$height - mean(FAA_b$height, na.rm = T))/sd(FAA_b$height, na.rm = T)
FAA_b$pitch_n <-
(FAA_b$pitch - mean(FAA_b$pitch, na.rm = T))/sd(FAA_b$pitch, na.rm = T)
A single factor regression analysis was again performed against response long.landing for each of the scaled predictors. The code is:
glm_aircraft_n <- glm(long.landing ~ aircraft,
data = FAA_b, family = "binomial")
glm_duration_n <- glm(long.landing ~ duration_n,
data = FAA_b, family = "binomial")
glm_no_pasg_n <- glm(long.landing ~ no_pasg_n,
data = FAA_b, family = "binomial")
glm_speed_ground_n <- glm(long.landing ~ speed_ground_n,
data = FAA_b, family = "binomial")
glm_speed_air_n <- glm(long.landing ~ speed_air_n,
data = FAA_b, family = "binomial")
glm_height_n <- glm(long.landing ~ height_n,
data = FAA_b, family = "binomial")
glm_pitch_n <- glm(long.landing ~ pitch_n,
data = FAA_b, family = "binomial")
| Variable | Abs. Coefficient | Odds Ratio | Direction | P-Value | |
|---|---|---|---|---|---|
| 4 | speed_ground | 0.4723458 | 1.603752 | positive | 0.0000000 |
| 5 | speed_air | 0.5123218 | 1.669162 | positive | 0.0000000 |
| 1 | aircraft | 0.8641199 | 2.372917 | positive | 0.0000840 |
| 7 | pitch | 0.4005278 | 1.492612 | positive | 0.0466498 |
| 6 | height | 0.0086240 | 1.008661 | positive | 0.4218576 |
| 3 | no_pasg | 0.0072564 | 1.007283 | negative | 0.6058565 |
| 2 | duration | 0.0010705 | 1.001071 | negative | 0.6305122 |
## [1] "--"
| Variable | Abs. Coefficient | Odds Ratio | Direction | P-Value | |
|---|---|---|---|---|---|
| 4 | speed_ground | 8.8497167 | 6972.413438 | positive | 0.0000000 |
| 5 | speed_air | 4.9881068 | 146.658509 | positive | 0.0000000 |
| 1 | aircraft | 0.8641199 | 2.372917 | positive | 0.0000840 |
| 7 | pitch | 0.2109056 | 1.234796 | positive | 0.0466498 |
| 6 | height | 0.0843842 | 1.088047 | positive | 0.4218576 |
| 3 | no_pasg | 0.0543600 | 1.055865 | negative | 0.6058565 |
| 2 | duration | 0.0517582 | 1.053121 | negative | 0.6305122 |
The factors are arranged in order of their importance based on the p-values. The p-value should be less than 0.05 for a 95% level of confidence used while testing the null hypothesis that the factor does not impact the response variable. The smaller the p-value, the more significant the factor.
Based on these criteria speed_ground, speed_air, aircraft and pitch were identified as statistically important factors for analysis.
Note: The odds ratio for long.landing is higher when aircraft make is Boeing.
The above factors and their relation with long.landing is shown below. The charts represent various graphical representations of the significant factors - Aircraft, Speed Ground, Speed Air and Pitch.
plot(as.factor(long.landing) ~ speed_ground, data = FAA_b)
plot(as.factor(long.landing) ~ speed_air, data = FAA_b)
plot(as.factor(long.landing) ~ pitch, data = FAA_b)
plot(jitter(long.landing,0.1)~jitter(speed_ground),FAA_b,xlab="Speed Ground",
ylab="Long Landing",pch=".")
plot(jitter(long.landing,0.1)~jitter(speed_air),FAA_b,xlab="Speed Air",
ylab="Long Landing",pch=".")
plot(jitter(long.landing,0.1)~jitter(pitch),FAA_b,xlab="Pitch",
ylab="Long Landing",pch=".")
ggplot(FAA_b,aes(x=aircraft,fill=as.factor(long.landing)))+
geom_bar(position="dodge")
ggplot(FAA_b,aes(x=speed_ground,fill=as.factor(long.landing)))+
geom_histogram(position="dodge",binwidth=1)
ggplot(FAA_b,aes(x=speed_air,fill=as.factor(long.landing)))+
geom_histogram(position="dodge",binwidth=1)
ggplot(FAA_b,aes(x=pitch,fill=as.factor(long.landing)))+
geom_histogram(position="dodge",binwidth=1)
The significant variables and their relationship with long landing were visualized using a bar chart, scatter plot and histogram.
Based on the above charts and graphs, the following conclusions can be made for each significant variable:
A full model is created based on output of last two steps after performing a collinearity check for the factors speed_ground and speed_air.
# checking for collinearity
model1 <- glm(long.landing ~ speed_ground,
data = FAA_b, family = "binomial")
model2 <- glm(long.landing ~ speed_air,
data = FAA_b, family = "binomial")
model3 <- glm(long.landing ~ speed_ground + speed_air,
data = FAA_b, family = "binomial")
correlation <- cor(FAA_b$speed_ground, FAA_b$speed_air, use = "complete.obs")
The correlation value between
speed_groundandspeed_airis 0.9879383.
speed_ground: 119.4703208speed_air: 106.3274885speed_ground + speed_air: 108.0278797The factors speed_groud and speed_air show a very high correlation. Based on the AIC values of the three models, the factor speed_air has the lowest AIC value. However, speed_air has 628 missing values while speed_ground has 0 missing values. Hence, speed_air is removed from further analysis and only speed_ground is included.
The full model is created based on the cleaned dataset using following snippet. Its various metrics are shown below:
full.model.l <- glm(long.landing ~ aircraft + speed_ground + pitch,
data = FAA_b,
family = "binomial")
s <- summary(full.model.l)
kable(s$coefficients, caption = "Model Coefficients", digits = 3)
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -67.929 | 10.484 | -6.479 | 0.000 |
| aircraftboeing | 3.043 | 0.733 | 4.150 | 0.000 |
| speed_ground | 0.615 | 0.092 | 6.694 | 0.000 |
| pitch | 1.066 | 0.604 | 1.765 | 0.078 |
The variable pitch is not significant for this final model. Aircraft make and speed_ground are the only two variables impacting the response long.landing significantly.
AIC for the model is 89.30904
step() method)Another model was built using forward stepwise selection with AIC criterion on all the factors.
model0 <- glm(
long.landing ~ aircraft + duration + no_pasg + speed_ground + speed_air + height + pitch,
data = FAA_b,
family = "binomial")
aic.model.l <- step(model0, trace = 0, direction = "forward")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -196.359 | 56.071 | -3.502 | 0.000 |
| aircraftboeing | 8.784 | 2.623 | 3.349 | 0.001 |
| height | 0.423 | 0.143 | 2.956 | 0.003 |
| speed_air | 1.985 | 0.708 | 2.804 | 0.005 |
| pitch | 1.469 | 1.055 | 1.392 | 0.164 |
| no_pasg | -0.074 | 0.070 | -1.050 | 0.294 |
| speed_ground | -0.226 | 0.384 | -0.587 | 0.557 |
| duration | 0.000 | 0.010 | 0.029 | 0.977 |
Other parameters of AIC model are:
The forward selection model with AIC criterion provided aircraft, speed_air and height as significant factors. height is significant in this model and was not significant in the model built in step 5.
AIC value for this model is 48.9094605 which is much lower than the full model in step 5, suggesting this is a better model.
step() method)Next, a model was built using forward stepwise selection with BIC criterion on all the factors.
bic.model.l <- step(model0, trace = 0, direction = "forward", criterion = "BIC")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -196.359 | 56.071 | -3.502 | 0.000 |
| aircraftboeing | 8.784 | 2.623 | 3.349 | 0.001 |
| height | 0.423 | 0.143 | 2.956 | 0.003 |
| speed_air | 1.985 | 0.708 | 2.804 | 0.005 |
| pitch | 1.469 | 1.055 | 1.392 | 0.164 |
| no_pasg | -0.074 | 0.070 | -1.050 | 0.294 |
| speed_ground | -0.226 | 0.384 | -0.587 | 0.557 |
| duration | 0.000 | 0.010 | 0.029 | 0.977 |
Other parameters of BIC model are:
The forward selection model with BIC criterion provided aircraft, speed_air and height as significant factors. The first two factors are same as the ones obtained in the model in Step 5. However, height is significant in this model and was not significant in the model built in step 5. This is the same model as obtained in step 6
AIC value for this model is 48.9094605 which is much lower than the model in step 5, suggesting this is a better model.
Given a limited opportunity to make my case, I would prefer following information and approach:
Model: The model has aircraft, speed_air and height as the significant predictors for the response variable long landing.
Table:
ggplot(FAA_b,aes(x=aircraft,fill=as.factor(long.landing)))+
geom_bar(position="dodge")
ggplot(FAA_b,aes(x=speed_air,fill=as.factor(long.landing)))+
geom_histogram(position="dodge",binwidth=1)
## Warning: Removed 628 rows containing non-finite values (stat_bin).
ggplot(FAA_b,aes(x=height,fill=as.factor(long.landing)))+
geom_histogram(position="dodge",binwidth=1)
Executive summary:
risky.landingSince risky.landing is binary, it will only have value 0 or 1. Their distribution is as follows:
ggplot(FAA_b, aes(x = risky.landing)) + geom_bar()
Distribution of long.landing variable
pct<-round(table(FAA_b$risky.landing)/length(FAA_b$risky.landing)*100,1)
labs<-c("No","Yes")
labs<-paste(labs,pct)
labs<-paste(labs,"%",sep="" )
pie(table(FAA_b$risky.landing),labels=labs,col=rainbow(length(labs)),
main="Pie chart of whether it was a risky landing")
It is observed that 7.5% of the observations are risky landings. Thus, the data for the response variable
risky.landing is unbalanced.
A single factor regression analysis against response risky.landing is performed for each potential risk factor. The code is:
r_glm_aircraft <- glm(risky.landing ~ aircraft,
data = FAA_b, family = "binomial")
r_glm_duration <- glm(risky.landing ~ duration,
data = FAA_b, family = "binomial")
r_glm_no_pasg <- glm(risky.landing ~ no_pasg,
data = FAA_b, family = "binomial")
r_glm_speed_ground <- glm(risky.landing ~ speed_ground,
data = FAA_b, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
r_glm_speed_air <- glm(risky.landing ~ speed_air,
data = FAA_b, family = "binomial")
r_glm_height <- glm(risky.landing ~ height,
data = FAA_b, family = "binomial")
r_glm_pitch <- glm(risky.landing ~ pitch,
data = FAA_b, family = "binomial")
The factors were already scaled in order to assess their impact on the response variable on the basis of the coefficient estimates.
A single factor regression analysis was again performed against response risky.landing for each of the scaled predictors.
r_glm_aircraft_n <- glm(risky.landing ~ aircraft,
data = FAA_b, family = "binomial")
r_glm_duration_n <- glm(risky.landing ~ duration_n,
data = FAA_b, family = "binomial")
r_glm_no_pasg_n <- glm(risky.landing ~ no_pasg_n,
data = FAA_b, family = "binomial")
r_glm_speed_ground_n <- glm(risky.landing ~ speed_ground_n,
data = FAA_b, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
r_glm_speed_air_n <- glm(risky.landing ~ speed_air_n,
data = FAA_b, family = "binomial")
r_glm_height_n <- glm(risky.landing ~ height_n,
data = FAA_b, family = "binomial")
r_glm_pitch_n <- glm(risky.landing ~ pitch_n,
data = FAA_b, family = "binomial")
| Variable | Abs. Coefficient | Odds Ratio | Direction | P-Value | |
|---|---|---|---|---|---|
| 4 | speed_ground | 0.6142187 | 1.848212 | positive | 0.0000001 |
| 5 | speed_air | 0.8704019 | 2.387870 | positive | 0.0000037 |
| 1 | aircraft | 1.0017753 | 2.723112 | positive | 0.0004561 |
| 7 | pitch | 0.3710720 | 1.449287 | positive | 0.1432961 |
| 3 | no_pasg | 0.0253793 | 1.025704 | negative | 0.1536237 |
| 2 | duration | 0.0011518 | 1.001152 | negative | 0.6801987 |
| 6 | height | 0.0022186 | 1.002221 | negative | 0.8705917 |
## [1] "--"
| Variable | Abs. Coefficient | Odds Ratio | Direction | P-Value | |
|---|---|---|---|---|---|
| 4 | speed_ground | 11.5078031 | 99489.071286 | positive | 0.0000001 |
| 5 | speed_air | 8.4744743 | 4790.903747 | positive | 0.0000037 |
| 1 | aircraft | 1.0017753 | 2.723112 | positive | 0.0004561 |
| 7 | pitch | 0.1953950 | 1.215791 | positive | 0.1432961 |
| 3 | no_pasg | 0.1901247 | 1.209400 | negative | 0.1536237 |
| 2 | duration | 0.0556912 | 1.057271 | negative | 0.6801987 |
| 6 | height | 0.0217086 | 1.021946 | negative | 0.8705917 |
The factors are arranged in order of their importance based on the p-values. The p-value should be less than 0.05 for a 95% level of confidence used while testing the null hypothesis that the factor does not impact the response variable. The smaller the p-value, the more significant the factor. Based on these criteria speed_ground, speed_air and aircraft were identified as statistically important factors for analysis.
Note: The odds ratio for risky.landing is higher when aircraft make is Boeing.
The above factors and their relation with risky.landing is shown below. The charts represent various graphical representations of the significant factors - Aircraft, Ground Speed, Air Speed
plot(as.factor(risky.landing) ~ speed_ground, data = FAA_b)
plot(as.factor(risky.landing) ~ speed_air, data = FAA_b)
plot(jitter(risky.landing,0.1)~jitter(speed_ground),FAA_b,xlab="Speed Ground",
ylab="Risky Landing",pch=".")
plot(jitter(risky.landing,0.1)~jitter(speed_air),FAA_b,xlab="Speed Air",
ylab="Risky Landing",pch=".")
ggplot(FAA_b,aes(x=aircraft,fill=as.factor(risky.landing)))+
geom_bar(position="dodge")
ggplot(FAA_b,aes(x=speed_ground,fill=as.factor(risky.landing)))+
geom_histogram(position="dodge",binwidth=1)
ggplot(FAA_b,aes(x=speed_air,fill=as.factor(risky.landing)))+
geom_histogram(position="dodge",binwidth=1)
## Warning: Removed 628 rows containing non-finite values (stat_bin).
The significant variables and their relationship with risky landing were visualized using a bar chart, scatter plot and histogram.
Based on the above charts and graphs, the following conclusions can be made for each significant variable:
A full model is created based on output of last two steps after performing a collinearity check for the factors speed_ground and speed_air.
# checking for collinearity
r_model1 <- glm(risky.landing ~ speed_ground,
data = FAA_b, family = "binomial")
r_model2 <- glm(risky.landing ~ speed_air,
data = FAA_b, family = "binomial")
r_model3 <- glm(risky.landing ~ speed_ground + speed_air,
data = FAA_b, family = "binomial")
correlation <- cor(FAA_b$speed_ground, FAA_b$speed_air, use = "complete.obs")
The correlation value between
speed_groundandspeed_airis 0.9879383.
speed_ground: 62.9306044speed_air: 48.5802507speed_ground + speed_air: 49.9528945The factors speed_ground and speed_air show a very high correlation. Based on the AIC values of the three models, the factor speed_air has the lowest AIC value. However, speed_air has 628 missing values while speed_ground has 0 missing values. Hence, speed_air is removed from further analysis and only speed_ground is included.
The full model is created based on the cleaned dataset using following snippet:
full.model.r <- glm(risky.landing ~ aircraft + speed_ground,
data = FAA_b,
family = "binomial")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -102.077 | 24.775 | -4.120 | 0.000 |
| aircraftboeing | 4.019 | 1.249 | 3.217 | 0.001 |
| speed_ground | 0.926 | 0.225 | 4.121 | 0.000 |
Aircraft make and ground speed are impacting the response
risky.landingsignificantly.AIC of model is 46.0965964
step() method)The model is creating using following snippet:
model0_r <- glm(
risky.landing ~ aircraft + duration + no_pasg + speed_ground + speed_air + height + pitch,
data = FAA_b,
family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
aic.model.r <- step(model0_r, trace = 0, direction = "forward")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -149.419 | 48.425 | -3.086 | 0.002 |
| speed_air | 1.617 | 0.654 | 2.472 | 0.013 |
| aircraftboeing | 7.330 | 3.002 | 2.442 | 0.015 |
| no_pasg | -0.120 | 0.096 | -1.253 | 0.210 |
| pitch | -1.316 | 1.430 | -0.920 | 0.357 |
| height | 0.045 | 0.058 | 0.786 | 0.432 |
| speed_ground | -0.164 | 0.498 | -0.328 | 0.743 |
| duration | 0.002 | 0.016 | 0.125 | 0.901 |
Other parameters of AIC model are:
The forward selection model with AIC criterion also provided aircraft and speed_air as significant factors. These are same as the variables selected in step 5
AIC value for this model is 38.1439647 which is much lower than the model in step 5 (46.0965964), suggesting this is a better model.
step() method)bic.model.r <- step(model0_r, trace = 0, direction = "forward", criterion = "BIC")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -149.419 | 48.425 | -3.086 | 0.002 |
| speed_air | 1.617 | 0.654 | 2.472 | 0.013 |
| aircraftboeing | 7.330 | 3.002 | 2.442 | 0.015 |
| no_pasg | -0.120 | 0.096 | -1.253 | 0.210 |
| pitch | -1.316 | 1.430 | -0.920 | 0.357 |
| height | 0.045 | 0.058 | 0.786 | 0.432 |
| speed_ground | -0.164 | 0.498 | -0.328 | 0.743 |
| duration | 0.002 | 0.016 | 0.125 | 0.901 |
Other parameters of BIC model are:
The forward selection model with BIC criterion also provides the same model as in step 6. Aircraft and speed_air are significant factors. These are same as the variables selected in step 5. AIC value for this model is 38.1439647 which is much lower than the model in step 5 (46.0965964), suggesting this is a better model.
In the full model, aircraft and speed_ground was used. As per AIC as well as BIC, aircraft and speed_air are significant factors.
Given a limited opportunity to make my case, I would prefer following information and approach:
Model: The model has aircraft and speed_air as the significant predictors for the response variable.
Table:
ggplot(FAA_b,aes(x=aircraft,fill=as.factor(risky.landing)))+
geom_bar(position="dodge")
ggplot(FAA_b,aes(x=speed_air,fill=as.factor(risky.landing)))+
geom_histogram(position="dodge",binwidth=1)
## Warning: Removed 628 rows containing non-finite values (stat_bin).
Executive summary:
long.landing and risky.landingthresh <- seq(0.01,0.5,0.01)
predprob_l <- predict(aic.model.l, type = "response")
predprob_r <- predict(aic.model.r, type = "response")
long.landing_l <- long.landing[!is.na(FAA_b$speed_air) & !is.na(FAA_b$duration)]
sensitivity <- specificity<-rep(NA,length(thresh))
for(j in seq(along=thresh)) {
pp<-ifelse(predprob_l < thresh[j], "no", "yes")
xx<-xtabs(~long.landing_l + pp, FAA_b)
specificity[j] <- xx[1,1]/(xx[1,1] + xx[1,2])
sensitivity[j] <- xx[2,2]/(xx[2,1] + xx[2,2])
}
par(mfrow= c(1,2))
matplot(thresh,cbind(sensitivity, specificity), type="l", xlab="Threshold", ylab="Proportion", lty=1:2)
plot(1-specificity, sensitivity, type="l");abline(0, 1, lty=2)
thresh <- seq(0.01, 0.5, 0.01)
predprob_l <- predict(aic.model.l, type = "response")
predprob_r <- predict(aic.model.r, type = "response")
risky.landing_r <- risky.landing[!is.na(FAA_b$speed_air) & !is.na(FAA_b$duration)]
sensitivity.r <- specificity.r <- rep(NA,length(thresh))
for(j in seq(along=thresh)) {
pp <- ifelse(predprob_r < thresh[j], "no", "yes")
xx <- xtabs(~risky.landing_r + pp, FAA_b)
specificity.r[j] <- xx[1, 1]/(xx[1, 1] + xx[1, 2])
sensitivity.r[j] <- xx[2, 2]/(xx[2, 1] + xx[2, 2])
}
par(mfrow=c(1,2))
matplot(thresh,cbind(sensitivity.r,specificity.r),type="l",xlab="Threshold",ylab="Proportion",lty=1:2)
plot(1-specificity.r,sensitivity.r,type="l");abline(0,1,lty=2)
plot(1-specificity,sensitivity, type="l", col="blue")
points(1-specificity.r,sensitivity.r,type="l",col="red")
lines(1-specificity.r,sensitivity.r, col="red",lty=2)
auc(long.landing_l, predprob_l)
## Area under the curve: 0.996
auc(risky.landing_r, predprob_r)
## Area under the curve: 0.9973
The ROC curves for the two final models for long landing and risky landing were observed. The plots indicate the trade-off between sensitivity and specificity i.e. when sensitivity is increased, specificity decreases and vice versa.The more area under the ROC curve, the better the predictive power of the model.We have a very high AUC value for both long landing and risky landing, however, the AUC for risky landing is slightly higher than that for long landing model.
Parameters:
| Aircraft | Boeing |
|---|---|
| duration | 200 |
| no_pasg | 80 |
| speed_ground | 115 |
| speed_air | 120 |
| height | 40 |
| pitch | 4 |
info <- data.frame(aircraft = character(), duration = numeric(), no_pasg = numeric(), speed_ground = numeric(), speed_air = numeric(), height = numeric(), pitch = numeric(), stringsAsFactors = FALSE)
info <- rbind(
info,
list("boeing", 200, 80, 115, 120, 40, 4))
colnames(info) <- c("aircraft", "duration", "no_pasg", "speed_ground", "speed_air", "height", "pitch")
pred_l <- predict(aic.model.l, info, type = "response", se.fit = T)
pred_r <- predict(full.model.r, info, type = "response", se.fit = T)
paste("Predict value for long landing = ", pred_l$fit[["1"]])
## [1] "Predict value for long landing = 1"
long.confidence <- c(pred_l$fit-2*pred_l$se.fit[1],pred_l$fit+2*pred_l$se.fit[1])
paste("predict value for risky landing = ", pred_r$fit[["1"]])
## [1] "predict value for risky landing = 0.999788977010519"
c(pred_r$fit-2*pred_r$se.fit[1],pred_r$fit+2*pred_r$se.fit[1])
## 1 1
## 0.9989074 1.0006706
The probability of long and risky landing was predicted for the given values of the variables. Both the probabilities were very high and estimated to be 1. The 95% confidence interval for long landing is very very narrow and it only includes the single point estimate 1. The 95% confidence interval for risky landing is also very narrow ranging from 0.9999997 to 1.0000002 i.e from 0.9999997 to 1 as probability cannot exceed 1.
The response variable risky.landing was predicted using logit, probit and cloglog links. The results are compared below.
model.logit.r <- glm(
risky.landing ~ aircraft + speed_air,
data = FAA_b,
family = binomial(link = logit))
model.probit.r <- glm(
risky.landing ~ aircraft + speed_air,
data = FAA_b,
family = binomial(link = probit))
model.cloglog.r <- glm(
risky.landing ~ aircraft + speed_air,
data = FAA_b,
family = binomial(link = cloglog))
logit.summary <- summary(model.logit.r)
probit.summary <- summary(model.probit.r)
cloglog.summary <- summary(model.cloglog.r)
Deviance Residuals
Logit Link
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6729040 -0.0081175 -0.0006824 -0.0024089 0.0000503 2.4773894
Probit Link
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6957399 -0.0001181 0.0000000 -0.0056729 0.0000000 2.3889898
CLog-Log Link
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.529384 -0.020475 -0.002957 -0.022744 0.000000 2.470875
Coefficients
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -134.0859 | 33.3811 | -4.0168 | 0.0001 |
| aircraftboeing | 4.5648 | 1.5081 | 3.0268 | 0.0025 |
| speed_air | 1.2240 | 0.3052 | 4.0101 | 0.0001 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -74.2249 | 16.7591 | -4.4289 | 0e+00 |
| aircraftboeing | 2.6447 | 0.7916 | 3.3409 | 8e-04 |
| speed_air | 0.6773 | 0.1533 | 4.4182 | 0e+00 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -103.6808 | 25.5564 | -4.0569 | 0.0000 |
| aircraftboeing | 3.2606 | 0.9964 | 3.2725 | 0.0011 |
| speed_air | 0.9423 | 0.2331 | 4.0427 | 0.0001 |
Other info
| Null deviance | Residual deviance | AIC | No. of Fischer scoring iterations | |
|---|---|---|---|---|
| Logit Link | 248.18 on 202 df | 26.296 on 200 df | 32 | 9 |
| Probit Link | 248.18 on 202 df | 26.138 on 200 df | 32 | 11 |
| CLog-Log Link | 248.18 on 202 df | 24.363 on 200 df | 30 | 12 |
The model for risky landing was built using logit, probit and complementary log-log link. The following conclusions were made on comparing the three models:
pred_logit <- predict(model.logit.r, type = "response")
pred_probit <- predict(model.probit.r, type = "response")
pred_cloglog <- predict(model.cloglog.r, type = "response")
ROC_logit <- roc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_logit)
ROC_probit <- roc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_probit)
ROC_cloglog <- roc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_cloglog)
plot(ROC_logit)
lines(ROC_probit, col= 'blue')
lines(ROC_cloglog, col= "red")
auc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_logit)
## Area under the curve: 0.9964
auc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_probit)
## Area under the curve: 0.9963
auc(FAA_b$risky.landing[!is.na(FAA_b$speed_air)], pred_cloglog)
## Area under the curve: 0.9964
All the three ROC curves are very closely overlapping each other. As it is difficult to compare the ROC curves, the AUC values were used to compare the models. Based on the AUC values, logit and cloglog model have the same area under the curve which is very slightly higher than that of the probit model.
pred_logit.index <- sort(as.numeric(names(tail(sort(pred_logit), 5))))
pred_logit.sort <- FAA_b[pred_logit.index,1:7]
kable(pred_logit.sort, caption = "Predict Full Model", digits = 4)
| aircraft | duration | no_pasg | speed_ground | speed_air | height | pitch | |
|---|---|---|---|---|---|---|---|
| 64 | boeing | 161.8925 | 72 | 129.2649 | 128.4177 | 33.9490 | 4.1400 |
| 176 | boeing | 197.5464 | 68 | 126.6692 | 127.9641 | 23.7642 | 2.9932 |
| 307 | boeing | 154.5246 | 67 | 129.3072 | 127.5933 | 23.9785 | 5.1547 |
| 362 | boeing | 63.3295 | 52 | 132.7847 | 132.9115 | 18.1770 | 4.1107 |
| 408 | airbus | 131.7311 | 60 | 131.0352 | 131.3379 | 28.2780 | 3.6602 |
pred_probit.index <- sort(as.numeric(names(tail(sort(pred_probit), 5))))
pred_probit.sort <- FAA_b[pred_probit.index,1:7]
kable(pred_probit.sort, caption = "Probit Model", digits = 4)
| aircraft | duration | no_pasg | speed_ground | speed_air | height | pitch | |
|---|---|---|---|---|---|---|---|
| 387 | boeing | 153.8345 | 61 | 126.8393 | 126.1186 | 20.5478 | 4.3346 |
| 408 | airbus | 131.7311 | 60 | 131.0352 | 131.3379 | 28.2780 | 3.6602 |
| 643 | airbus | 137.5857 | 66 | 126.2443 | 127.9371 | 35.1757 | 2.7019 |
| 751 | airbus | 175.5144 | 49 | 125.2123 | 125.1385 | 22.5248 | 4.3658 |
| 769 | airbus | 98.5003 | 66 | 123.3105 | 124.3908 | 22.3272 | 4.2767 |
pred_cloglog.index <- sort(as.numeric(names(tail(sort(pred_cloglog), 5))))
pred_cloglog.sort <- FAA_b[pred_cloglog.index,1:7]
kable(pred_cloglog.sort, caption = "cloglog Model", digits = 4)
| aircraft | duration | no_pasg | speed_ground | speed_air | height | pitch | |
|---|---|---|---|---|---|---|---|
| 669 | airbus | 140.4531 | 75 | 120.4189 | 118.4847 | 31.2634 | 2.7967 |
| 751 | airbus | 175.5144 | 49 | 125.2123 | 125.1385 | 22.5248 | 4.3658 |
| 765 | airbus | 220.0571 | 61 | 120.5579 | 118.2882 | 15.6657 | 4.1113 |
| 769 | airbus | 98.5003 | 66 | 123.3105 | 124.3908 | 22.3272 | 4.2767 |
| 821 | airbus | NA | 58 | 113.4274 | 114.8444 | 40.1011 | 4.2484 |
all.3.df <- cbind(pred_logit.index, pred_probit.index, pred_cloglog.index);
colnames(all.3.df) <- c("pred_logit", "pred_probit", "pred_cloglog")
kable(all.3.df, caption="Top 5 predictions")
| pred_logit | pred_probit | pred_cloglog |
|---|---|---|
| 64 | 387 | 669 |
| 176 | 408 | 751 |
| 307 | 643 | 765 |
| 362 | 751 | 769 |
| 408 | 769 | 821 |
pred_probit and pred_cloglog share two values in top 5 (Index 751 and 769). Apart from these two, there are no common values across the 3 columns.
model.logit.l <- glm(
long.landing ~ aircraft + speed_air + height,
data = FAA_b,
family = binomial(link = "logit"))
model.probit.l <- glm(
long.landing ~ aircraft + speed_air + height,
data = FAA_b,
family = binomial(link = "probit"))
model.cloglog.l <- glm(
long.landing ~ aircraft + speed_air + height,
data = FAA_b,
family = binomial(link = "cloglog"))
pred_l_logit <- predict(model.logit.l, info, type = "response", se.fit = T)
pred_l_probit <- predict(model.probit.l, info, type = "response", se.fit = T)
pred_l_cloglog <- predict(model.cloglog.l, info, type = "response", se.fit = T)
pred_r_logit <- predict(full.model.r, info, type = "response", se.fit = T)
pred_r_probit <- predict(model.probit.r, info, type = "response", se.fit = T)
pred_r_cloglog <- predict(model.cloglog.r, info, type = "response", se.fit = T)
pred_l <- list(pred_l_logit, pred_l_probit, pred_l_cloglog)
pred_r <- list(pred_r_logit, pred_r_probit, pred_r_cloglog)
| Estimated Probability | CI Lower Limit | CI Upper Limit | |
|---|---|---|---|
| Logit Link | 1 | 1 | 1 |
| Probit Link | 1 | 1 | 1 |
| CLog-Log Link | 1 | 1 | 1 |
| Estimated Probability | CI Lower Limit | CI Upper Limit | |
|---|---|---|---|
| Logit Link | 0.999789 | 0.9989074 | 1.000671 |
| Probit Link | 1.000000 | 1.0000000 | 1.000000 |
| CLog-Log Link | 1.000000 | 1.0000000 | 1.000000 |
The predicted probablity for both long landing and risky landing for logit, probit and cloglog link models is 1. These models predict very narrow confidence intervals. The 95% confidence interval for long landing for all three linkges is very very narrow and it only includes the single point estimate 1. The 95% confidence interval for risky landing is also very narrow ranging from 0.9989074 to 1.000671 i.e from 0.9989074 to 1 as probability cannot exceed 1 for logit link and includes only the point estimate for probit and cloglog link. Thus, it can be concluded that for the given values of the variables the landing will be long and risky as seen from the results of all the models.