Name: Shravya Challa
UCID: M13940819
Background: Flight landing.
Motivation: To reduce the risk of landing overrun.
Goal: To study what factors and how they would impact the landing distance of a commercial flight.
Data: Landing data (landing distance and other parameters) from 950 commercial flights (not real data set but simulated from statistical models).
Variable dictionary:
Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.
library(readxl)
library(tidyr)
library(ggplot2)
library(dplyr)
library(MASS)
library(broom)
library(pROC)
Initial exploration of data
Step1: Read two files FAA1.xls and FAA2.xls into R system.
Using read_excel from readxl library to read the files
faa1 <- read_excel("C:/Users/shrav/OneDrive/D/courses/Statistical model/FAA1.xls")
faa2 <- read_excel("C:/Users/shrav/OneDrive/D/courses/Statistical model/FAA2.xls")
Step2: Check the structure of 2 data sets
str(faa1)
str(faa2)
Step3: Merge the 2 datasets and check for duplicates
# merge datasets
faa2['duration'] <- NA
faa <- rbind(faa1, faa2)
# remove duplicates
idx <- duplicated(faa[,-2])
faa <- faa[!idx,]
Step4: Check the structure of the dataset and summary statistics
str(faa)
## tibble[,8] [850 x 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:850] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:850] 109 103 NA NA NA ...
## $ height : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:850] 3370 2988 1145 1664 1050 ...
summary(faa)
## aircraft duration no_pasg speed_ground
## Length:850 Min. : 14.76 Min. :29.0 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90
## Mode :character Median :153.95 Median :60.0 Median : 79.64
## Mean :154.01 Mean :60.1 Mean : 79.45
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06
## Max. :305.62 Max. :87.0 Max. :141.22
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :101.15 Median :30.093 Median :4.008 Median :1258.09
## Mean :103.80 Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :642
Step5: Initial observations to the FAA Agent
Data cleaning and further exploration
Step6: Find abnormal values in the dataset and remove them
faa4 <- faa
faa4 <- faa4[!(faa4$speed_ground < 30 | faa4$speed_ground > 140),]
idx2 <- which(faa4$duration < 40)
faa4 <- faa4[-idx2,]
idx3 <- which(faa4$speed_air < 30 | faa4$speed_air > 140)
faa4 <- faa4[!(faa4$height <6),]
faa4 <- faa4[!(faa4$distance > 6000),]
Step7: Structure of the dataset, summary statistics
str(faa4)
## tibble[,8] [831 x 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:831] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:831] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:831] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:831] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:831] 109 103 NA NA NA ...
## $ height : num [1:831] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:831] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:831] 3370 2988 1145 1664 1050 ...
summary(faa4)
## aircraft duration no_pasg speed_ground
## Length:831 Min. : 41.95 Min. :29.00 Min. : 33.57
## Class :character 1st Qu.:119.63 1st Qu.:55.00 1st Qu.: 66.20
## Mode :character Median :154.28 Median :60.00 Median : 79.79
## Mean :154.78 Mean :60.06 Mean : 79.54
## 3rd Qu.:189.66 3rd Qu.:65.00 3rd Qu.: 91.91
## Max. :305.62 Max. :87.00 Max. :132.78
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.23 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.28
## Median :101.12 Median :30.167 Median :4.001 Median :1262.15
## Mean :103.48 Mean :30.458 Mean :4.005 Mean :1522.48
## 3rd Qu.:109.36 3rd Qu.:37.004 3rd Qu.:4.370 3rd Qu.:1936.63
## Max. :132.91 Max. :59.946 Max. :5.927 Max. :5381.96
## NA's :628
Step8: Histograms for all variables
faa4[,-1] %>%
gather() %>%
ggplot(aes(value)) + # ggplot2 histogram & density
facet_wrap(~ key, scales = "free") +
geom_histogram(aes(y = stat(density))) +
geom_density(col = "blue", size = 1)
Step9: Informed findings to the FAA agent
Initial analysis to identify important factors influencing Landing distance
Step10: Computing correlation between landing distance and every other variable and storing the values in a table
cor_val <- cor(faa4[,c(-1, -8)], faa4$distance, use = "pairwise.complete.obs")
table1 <- data.frame(cor_val)
table1 <- table1 %>%
mutate(sign = case_when(
cor_val >= 0 ~ "Positive",
cor_val < 0 ~ "Negative"
))
table1$cor_val <- abs(table1$cor_val)
table1 <- table1 %>% arrange(desc(cor_val))
table1
## cor_val sign
## speed_air 0.94209714 Positive
## speed_ground 0.86624383 Positive
## height 0.09941121 Positive
## pitch 0.08702846 Positive
## duration 0.05138252 Negative
## no_pasg 0.01775663 Negative
Step11: Scatter plots with distance
faa4[,-1] %>%
gather(-distance, key = "var", value = "value") %>%
ggplot(aes(x = value, y = distance)) +
geom_point() +
facet_wrap(~ var, scales = "free") +
theme_bw()
Step12: Convert aircraft to numeric 0/1 and repeat steps 10 and 11
faa4 <- faa4 %>%
mutate(aircraft1 = case_when(
aircraft == "boeing" ~ 1,
aircraft == "airbus" ~ 0
))
cor_val <- cor(faa4[,c(-1, -8)], faa4$distance, use = "pairwise.complete.obs")
table1 <- data.frame(cor_val)
table1 <- table1 %>%
mutate(sign = case_when(
cor_val >= 0 ~ "Positive",
cor_val < 0 ~ "Negative"
))
table1$cor_val <- abs(table1$cor_val)
table1 <- table1 %>% arrange(desc(cor_val))
table1
## cor_val sign
## speed_air 0.94209714 Positive
## speed_ground 0.86624383 Positive
## aircraft1 0.23814452 Positive
## height 0.09941121 Positive
## pitch 0.08702846 Positive
## duration 0.05138252 Negative
## no_pasg 0.01775663 Negative
faa4[,-1] %>%
gather(-distance, key = "var", value = "value") %>%
ggplot(aes(x = value, y = distance)) +
geom_point() +
facet_wrap(~ var, scales = "free") +
theme_bw()
Regression using single factor each time
Step13: Regress Y on each of the X variables. Store p-value and direction of regression co-efficient in table2
col10 <- names(faa4)[-c(1,8)]
lm.test <- vector("list", length(col10))
table2 <- data.frame(matrix(ncol=3,nrow=0))
colnames(table2) <- c("variable names", "p_value", "Direction")
for(i in seq_along(col10)){
lm.test[[i]] <- lm(reformulate(col10[i], "distance"), data = faa4)
coef <- tidy(lm.test[[i]])[2,2]
val <- case_when(
coef >= 0 ~ "Positive",
coef < 0 ~ "Negative"
)
table2[nrow(table2)+1,] = c(col10[i], as.numeric(glance(lm.test[[i]])$p.value), val)
}
table2$p_value <- as.numeric(table2$p_value)
table2 <- table2 %>% arrange(p_value)
table2
## variable names p_value Direction
## 1 speed_ground 4.766371e-252 Positive
## 2 speed_air 2.500461e-97 Positive
## 3 aircraft1 3.526194e-12 Positive
## 4 height 4.123860e-03 Positive
## 5 pitch 1.208124e-02 Positive
## 6 duration 1.514002e-01 Negative
## 7 no_pasg 6.092520e-01 Negative
Step14: Standardize the regressors, repeat step13 and store the coefficients in table3
col10 <- names(faa4)[-c(1,8)]
lm.test2 <- vector("list", length(col10))
table3 <- data.frame(matrix(ncol=3,nrow=0))
colnames(table3) <- c("variable names", "coef", "Direction")
for(i in seq_along(col10)){
data1 <- faa4[,col10[i]]
colnames(data1)[1] <- "x1"
faa4["data"] <- (data1$x1 - mean(data1$x1, na.rm = T))/sd(data1$x1, na.rm = T)
lm.test2[[i]] <- lm(reformulate("data", "distance"), data = faa4)
coef <- tidy(lm.test2[[i]])[2,2]
val <- case_when(
coef >= 0 ~ "Positive",
coef < 0 ~ "Negative"
)
table3[nrow(table3)+1,] = c(col10[i], abs(tidy(lm.test2[[i]])[2,2]), val)
}
table3 <- table3 %>% arrange(desc(coef))
table3
## variable names coef Direction
## 1 speed_ground 776.44740 Positive
## 2 speed_air 774.34654 Positive
## 3 aircraft1 213.45802 Positive
## 4 height 89.10606 Positive
## 5 pitch 78.00693 Positive
## 6 duration 46.48013 Negative
## 7 no_pasg 15.91595 Negative
Step15: Comparing tables 1,2,3 and ranking the variables based on significance
table0 <- data.frame(variable_names = table2$`variable names`, rank = rank(table2$p_value))
table0
## variable_names rank
## 1 speed_ground 1
## 2 speed_air 2
## 3 aircraft1 3
## 4 height 4
## 5 pitch 5
## 6 duration 6
## 7 no_pasg 7
Check collinearity
Step16: compare regression coefficients
Model 1: LD ~ Speed_ground
Model 2: LD ~ Speed_air
Model 3: LD ~ Speed_ground + Speed_air
model1 <- lm(distance~speed_ground, data=faa4)
model2 <- lm(distance~speed_air, data=faa4)
model3 <- lm(distance~speed_ground+speed_air, data=faa4)
summary(model1)
##
## Call:
## lm(formula = distance ~ speed_ground, data = faa4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -897.09 -319.16 -72.09 210.83 1798.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1773.9407 67.8388 -26.15 <2e-16 ***
## speed_ground 41.4422 0.8302 49.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 448.1 on 829 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.7501
## F-statistic: 2492 on 1 and 829 DF, p-value: < 2.2e-16
summary(model2)
##
## Call:
## lm(formula = distance ~ speed_air, data = faa4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -776.21 -196.39 8.72 209.17 624.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5455.709 207.547 -26.29 <2e-16 ***
## speed_air 79.532 1.997 39.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.3 on 201 degrees of freedom
## (628 observations deleted due to missingness)
## Multiple R-squared: 0.8875, Adjusted R-squared: 0.887
## F-statistic: 1586 on 1 and 201 DF, p-value: < 2.2e-16
summary(model3)
##
## Call:
## lm(formula = distance ~ speed_ground + speed_air, data = faa4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -819.74 -202.02 3.52 211.25 636.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5462.28 207.48 -26.327 < 2e-16 ***
## speed_ground -14.37 12.68 -1.133 0.258
## speed_air 93.96 12.89 7.291 6.99e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 200 degrees of freedom
## (628 observations deleted due to missingness)
## Multiple R-squared: 0.8883, Adjusted R-squared: 0.8871
## F-statistic: 795 on 2 and 200 DF, p-value: < 2.2e-16
From summary of the three models, we can observe that when speed_air and speed_ground are modeled individually, the direction of coefficient is the same as their correlation direction with distance.
In model3, speed_ground direction is changed and it’s p_value indicates that the coefficient is not significant.
This is a contradiction from the tables earlier.
Check correlation between speed_ground and speed_air
cor(faa4$speed_air, faa4$speed_ground, use = "pairwise.complete.obs")
## [1] 0.9879383
There is a high correlation between speed_air and speed_ground indicating that if these two variables are modeled together there is a high chance of multi collinearity.
Since there is multicollinearity, the results of model3 are unreliable and there is a sign change for coefficient of speed_ground.
To overcome this issue we choose only speed_ground for the following reasons
Variable selection based on ranking in table0
Step17: Incremental models based on ranking in table0
model1 <- lm(distance~speed_ground, data = faa4)
model2 <- lm(distance~speed_ground + speed_air, data = faa4)
model3 <- lm(distance~speed_ground + speed_air + aircraft1, data = faa4)
model4 <- lm(distance~speed_ground + speed_air + aircraft1 + height, data = faa4)
model5 <- lm(distance~speed_ground + speed_air+ aircraft1 + height + pitch, data = faa4)
model6 <- lm(distance~speed_ground + speed_air+ aircraft1 + height + pitch + duration, data = faa4)
model7 <- lm(distance~speed_ground + speed_air + aircraft1 + height + pitch + duration + no_pasg, data = faa4)
sum_1 <- c()
sum_1 <- append(sum_1, summary(model1)$r.squared)
sum_1 <- append(sum_1, summary(model2)$r.squared)
sum_1 <- append(sum_1, summary(model3)$r.squared)
sum_1 <- append(sum_1, summary(model4)$r.squared)
sum_1 <- append(sum_1, summary(model5)$r.squared)
sum_1 <- append(sum_1, summary(model6)$r.squared)
sum_1 <- append(sum_1, summary(model7)$r.squared)
plot(sum_1, pch=20, cex=1, type = 'b')
Step18: Repeat step17 with adjusted R squared
sum_2 <- c()
sum_2 <- append(sum_2, summary(model1)$adj.r.squared)
sum_2 <- append(sum_2, summary(model2)$adj.r.squared)
sum_2 <- append(sum_2, summary(model3)$adj.r.squared)
sum_2 <- append(sum_2, summary(model4)$adj.r.squared)
sum_2 <- append(sum_2, summary(model5)$adj.r.squared)
sum_2 <- append(sum_2, summary(model6)$adj.r.squared)
sum_2 <- append(sum_2, summary(model7)$adj.r.squared)
plot(sum_2, pch=20, cex=1, type = 'b')
Step19: Repeat step17 with AIC values
aic_1 <- c()
aic_1 <- append(aic_1, AIC(model1))
aic_1 <- append(aic_1, AIC(model2))
aic_1 <- append(aic_1, AIC(model3))
aic_1 <- append(aic_1, AIC(model4))
aic_1 <- append(aic_1, AIC(model5))
aic_1 <- append(aic_1, AIC(model6))
aic_1 <- append(aic_1, AIC(model7))
plot(aic_1, pch=20, cex=1, type = 'b')
Step20: Based on steps 18-19, what variables can be selected to build a model for landing distance
Step21: StepAIC function to perform forward variable selection
faa5 <- faa4[,-1]
faa5 <- na.omit(faa5)
mlr <- lm(distance~1, data = faa5)
stepAIC(mlr, direction = "forward", scope = list(upper = ~speed_ground + speed_air + aircraft1 + height + pitch + duration + no_pasg, lower = mlr))
## Start: AIC=2622.4
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_air 1 118926290 14749519 2194.6
## + speed_ground 1 115311069 18364740 2237.3
## + aircraft1 1 3978580 129697229 2618.5
## <none> 133675809 2622.4
## + height 1 445916 133229893 2623.7
## + duration 1 367280 133308529 2623.9
## + pitch 1 154735 133521074 2624.2
## + no_pasg 1 141913 133533895 2624.2
##
## Step: AIC=2194.58
## distance ~ speed_air
##
## Df Sum of Sq RSS AIC
## + aircraft1 1 8088045 6661474 2041.6
## + height 1 2623377 12126142 2158.4
## + pitch 1 847903 13901616 2185.0
## <none> 14749519 2194.6
## + no_pasg 1 142098 14607421 2194.7
## + speed_ground 1 68916 14680603 2195.7
## + duration 1 14495 14735024 2196.4
##
## Step: AIC=2041.58
## distance ~ speed_air + aircraft1
##
## Df Sum of Sq RSS AIC
## + height 1 3217699 3443775 1914.9
## <none> 6661474 2041.6
## + duration 1 61424 6600051 2041.8
## + no_pasg 1 47410 6614064 2042.2
## + speed_ground 1 39437 6622037 2042.4
## + pitch 1 14544 6646931 2043.2
##
## Step: AIC=1914.92
## distance ~ speed_air + aircraft1 + height
##
## Df Sum of Sq RSS AIC
## + no_pasg 1 39886 3403889 1914.7
## <none> 3443775 1914.9
## + duration 1 12694 3431082 1916.2
## + pitch 1 8137 3435638 1916.5
## + speed_ground 1 6494 3437282 1916.5
##
## Step: AIC=1914.65
## distance ~ speed_air + aircraft1 + height + no_pasg
##
## Df Sum of Sq RSS AIC
## <none> 3403889 1914.7
## + duration 1 9733.8 3394155 1916.1
## + pitch 1 8832.0 3395057 1916.1
## + speed_ground 1 5823.5 3398066 1916.3
##
## Call:
## lm(formula = distance ~ speed_air + aircraft1 + height + no_pasg,
## data = faa5)
##
## Coefficients:
## (Intercept) speed_air aircraft1 height no_pasg
## -6263.754 82.032 432.074 13.776 -2.041
Create Binary responses
Step1: Create long.landing and risky.landing binary variables
faa6 <- faa4
faa6 <- faa6[, -10]
faa6$long.landing <- ifelse(faa6$distance > 2500, 1, 0)
faa6$risky.landing <- ifelse(faa6$distance > 3000, 1, 0)
faa6 <- faa6[,-8]
str(faa6)
## tibble[,10] [831 x 10] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:831] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:831] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:831] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground : num [1:831] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:831] 109 103 NA NA NA ...
## $ height : num [1:831] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:831] 4.04 4.12 4.43 3.88 4.03 ...
## $ aircraft1 : num [1:831] 1 1 1 1 1 1 1 1 1 1 ...
## $ long.landing : num [1:831] 1 1 0 0 0 0 0 0 0 0 ...
## $ risky.landing: num [1:831] 1 0 0 0 0 0 0 0 0 0 ...
Identifying important factors using binary data long.landing
Step2: Pie chart and histogram of long landing
hist(faa6$long.landing, prob = TRUE,
main = "Histogram with density curve")
lines(density(faa6$long.landing), col = 4, lwd = 2)
long_landing <- table(faa6$long.landing)
labels <- paste0(c(0,1), " = ", round(100 * long_landing/sum(long_landing), 2), "%")
pie(long_landing, labels = labels, main = "Pie chart of long.landing")
Step3: Single factor regression analysis using logistic regression
Single factor regression with original values
table1_2_df <- data.frame(matrix(ncol=5,nrow=0))
colnames(table1_2_df) <- c("variable names", "size of coef", "odds ratio", "Direction", "pvalue")
col10 <- names(faa6)[-c(1,9,10)]
glm.test <- vector("list", length(col10))
for(i in seq_along(col10)){
glm.test[[i]] <- glm(reformulate(col10[i], "long.landing"), family = binomial, data = faa6)
coef <- tidy(glm.test[[i]])[2,2]
val <- case_when(
coef >= 0 ~ "Positive",
coef < 0 ~ "Negative"
)
table1_2_df[nrow(table1_2_df)+1,] = c(col10[i], abs(coef), exp(coef), val, as.numeric(summary(glm.test[[i]])$coefficients[2,4]))
}
table1_2_df <- table1_2_df %>% arrange(pvalue)
table1_2_df
## variable names size of coef odds ratio Direction pvalue
## 1 speed_ground 0.472345752 1.6037518 Positive 3.935339e-14
## 2 speed_air 0.512321766 1.6691621 Positive 4.334124e-11
## 3 aircraft1 0.864119860 2.3729167 Positive 8.398591e-05
## 4 pitch 0.400527824 1.4926123 Positive 4.664982e-02
## 5 height 0.008623997 1.0086613 Positive 4.218576e-01
## 6 no_pasg 0.007256406 0.9927699 Negative 6.058565e-01
## 7 duration 0.001070492 0.9989301 Negative 6.305122e-01
Single factor regression with standardized values
table2_2_df <- data.frame(matrix(ncol=5,nrow=0))
colnames(table2_2_df) <- c("variable names", "size of coef", "odds ratio", "Direction", "pvalue")
glm.test2 <- vector("list", length(col10))
for(i in seq_along(col10)){
data1 <- faa6[,col10[i]]
colnames(data1)[1] <- "x1"
faa6["data"] <- (data1$x1 - mean(data1$x1, na.rm = T))/sd(data1$x1, na.rm = T)
glm.test2[[i]] <- glm(reformulate("data", "long.landing"), family = binomial, data = faa6)
coef <- tidy(glm.test2[[i]])[2,2]
val <- case_when(
coef >= 0 ~ "Positive",
coef < 0 ~ "Negative"
)
table2_2_df[nrow(table2_2_df)+1,] = c(col10[i], abs(coef), exp(coef), val, as.numeric(summary(glm.test2[[i]])$coefficients[2,4]))
}
table2_2_df <- table2_2_df %>% arrange(pvalue)
table2_2_df
## variable names size of coef odds ratio Direction pvalue
## 1 speed_ground 8.84971670 6972.4134377 Positive 3.935339e-14
## 2 speed_air 4.98810682 146.6585091 Positive 4.334124e-11
## 3 aircraft1 0.43130192 1.5392602 Positive 8.398591e-05
## 4 pitch 0.21090555 1.2347957 Positive 4.664982e-02
## 5 height 0.08438418 1.0880468 Positive 4.218576e-01
## 6 no_pasg 0.05436003 0.9470911 Negative 6.058565e-01
## 7 duration 0.05175818 0.9495585 Negative 6.305122e-01
step4: Visualizations for significant factors
vars <- table1_2_df$`variable names`[table1_2_df$pvalue<0.05]
for(i in seq_along(vars)) {
d <- subset (faa6, select = c(vars[i],"long.landing"))
par(mfrow = c(1,2))
plot(jitter(d$long.landing, 0.1)~jitter(d[[vars[i]]]),xlab = colnames(d)[1],ylab = "long.landing",pch=".")
d$long.landing <- factor(d$long.landing, levels = c(0, 1))
plot(long.landing ~ ., data = d)
par(mfrow = c(1,1))
print(ggplot(d,aes_string(x= colnames(d)[1], fill = "long.landing", colour = "long.landing")) +
geom_histogram(position = "dodge",binwidth = 1))
}
Step5: Full model, based on the earlier results and removing multicollinearity
full_model_2 <- glm(long.landing~speed_ground + aircraft1 + pitch, family = "binomial", data = faa6)
summary(full_model_2)
##
## Call:
## glm(formula = long.landing ~ speed_ground + aircraft1 + pitch,
## family = "binomial", data = faa6)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.11589 -0.01116 -0.00026 0.00000 2.40741
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -67.92855 10.48408 -6.479 9.22e-11 ***
## speed_ground 0.61471 0.09184 6.694 2.18e-11 ***
## aircraft1 3.04348 0.73345 4.150 3.33e-05 ***
## pitch 1.06599 0.60389 1.765 0.0775 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 622.778 on 830 degrees of freedom
## Residual deviance: 81.309 on 827 degrees of freedom
## AIC: 89.309
##
## Number of Fisher Scoring iterations: 10
BIC(full_model_2)
## [1] 108.1996
Step6: Forward selection using AIC
faa7 <- na.omit(faa6)
full <- glm(long.landing~.-risky.landing-aircraft, family = "binomial", data = faa7)
nothing <- glm(long.landing~1, family = "binomial", data = faa7)
full.forward.AIC <- step(nothing,
scope = list(lower = formula(nothing),
upper = formula(full)),
direction = "forward")
summary(full.forward.AIC)
##
## Call:
## glm(formula = long.landing ~ speed_air + aircraft1 + height +
## pitch, family = "binomial", data = faa7)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.36696 -0.02116 0.00000 0.00098 1.64748
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -182.2356 49.9549 -3.648 0.000264 ***
## speed_air 1.5923 0.4285 3.716 0.000203 ***
## aircraft1 8.1444 2.3153 3.518 0.000435 ***
## height 0.3635 0.1142 3.184 0.001454 **
## pitch 1.5957 1.0958 1.456 0.145327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 270.199 on 194 degrees of freedom
## Residual deviance: 34.278 on 190 degrees of freedom
## AIC: 44.278
##
## Number of Fisher Scoring iterations: 10
BIC(full.forward.AIC)
## [1] 60.64335
Step7: Forward selection using BIC
full.forward.BIC <- step(nothing,
scope = list(lower = formula(nothing),
upper = formula(full)),
direction = "forward",
k = log(nrow(faa7)))
summary(full.forward.BIC)
##
## Call:
## glm(formula = long.landing ~ speed_air + aircraft1 + height,
## family = "binomial", data = faa7)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.50362 -0.04226 0.00000 0.00348 2.22995
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -150.64570 36.73061 -4.101 4.11e-05 ***
## speed_air 1.35748 0.32882 4.128 3.65e-05 ***
## aircraft1 7.42305 1.97283 3.763 0.000168 ***
## height 0.32871 0.09986 3.292 0.000996 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 270.199 on 194 degrees of freedom
## Residual deviance: 36.798 on 191 degrees of freedom
## AIC: 44.798
##
## Number of Fisher Scoring iterations: 10
BIC(full.forward.BIC)
## [1] 57.89043
Step8: Presentation to FAA agent
long.landing ~ speed_air + aircraft1 + height
* There are higher number of long landings for boeing aircraft and higher speed in the air.
## variable names size of coef odds ratio Direction
## 1 speed_air 1.35748 3.8863872554824 Positive
## 2 aircraft1 7.42305 1674.13183093398 Positive
## 3 height 0.32871 1.38917493643559 Positive
Identifying important factors for risky landing
Step9: Repeat steps 1-7 for risky landing
Histogram and pie chart for risky landing
hist(faa6$risky.landing, prob = TRUE,
main = "Histogram with density curve")
lines(density(faa6$risky.landing), col = 4, lwd = 2)
risky_landing <- table(faa6$risky.landing)
labels <- paste0(c(0,1), " = ", round(100 * risky_landing/sum(risky_landing), 2), "%")
pie(risky_landing, labels = labels, main = "Pie chart of risky.landing")
* There is only 7.34% of observations that are risky landing in the data as noted from the pie chart and histogram.
Single factor regression with original values
table1_2_df <- data.frame(matrix(ncol=5,nrow=0))
colnames(table1_2_df) <- c("variable names", "size of coef", "odds ratio", "Direction", "pvalue")
col10 <- names(faa6)[-c(1,9,10)]
glm.test <- vector("list", length(col10))
for(i in seq_along(col10)){
glm.test[[i]] <- glm(reformulate(col10[i], "risky.landing"), family = binomial, data = faa6)
coef <- tidy(glm.test[[i]])[2,2]
val <- case_when(
coef >= 0 ~ "Positive",
coef < 0 ~ "Negative"
)
table1_2_df[nrow(table1_2_df)+1,] = c(col10[i], abs(coef), exp(coef), val, as.numeric(summary(glm.test[[i]])$coefficients[2,4]))
}
table1_2_df <- table1_2_df %>% arrange(pvalue)
table1_2_df
## variable names size of coef odds ratio Direction pvalue
## 1 speed_ground 0.614218747 1.8482121 Positive 6.898006e-08
## 2 speed_air 0.870401902 2.3878703 Positive 3.728032e-06
## 3 aircraft1 1.001775330 2.7231120 Positive 4.560563e-04
## 4 pitch 0.371071969 1.4492874 Positive 1.432961e-01
## 5 no_pasg 0.025379344 0.9749400 Negative 1.536237e-01
## 6 duration 0.001151836 0.9988488 Negative 6.801987e-01
## 7 height 0.002218606 0.9977839 Negative 8.705917e-01
Single factor regression with standardized values
table2_2_df <- data.frame(matrix(ncol=5,nrow=0))
colnames(table2_2_df) <- c("variable names", "size of coef", "odds ratio", "Direction", "pvalue")
glm.test2 <- vector("list", length(col10))
for(i in seq_along(col10)){
data1 <- faa6[,col10[i]]
colnames(data1)[1] <- "x1"
faa6["data"] <- (data1$x1 - mean(data1$x1, na.rm = T))/sd(data1$x1, na.rm = T)
glm.test2[[i]] <- glm(reformulate("data", "risky.landing"), family = binomial, data = faa6)
coef <- tidy(glm.test2[[i]])[2,2]
val <- case_when(
coef >= 0 ~ "Positive",
coef < 0 ~ "Negative"
)
table2_2_df[nrow(table2_2_df)+1,] = c(col10[i], abs(coef), exp(coef), val, as.numeric(summary(glm.test2[[i]])$coefficients[2,4]))
}
table2_2_df <- table2_2_df %>% arrange(pvalue)
table2_2_df
## variable names size of coef odds ratio Direction pvalue
## 1 speed_ground 11.50780308 9.948907e+04 Positive 6.898006e-08
## 2 speed_air 8.47447435 4.790904e+03 Positive 3.728032e-06
## 3 aircraft1 0.50000891 1.648736e+00 Positive 4.560563e-04
## 4 pitch 0.19539501 1.215791e+00 Positive 1.432961e-01
## 5 no_pasg 0.19012470 8.268560e-01 Negative 1.536237e-01
## 6 duration 0.05569118 9.458312e-01 Negative 6.801987e-01
## 7 height 0.02170864 9.785253e-01 Negative 8.705917e-01
Visualizations for significant factors
vars <- table1_2_df$`variable names`[table1_2_df$pvalue<0.05]
for(i in seq_along(vars)) {
d <- subset (faa6, select = c(vars[i],"risky.landing"))
par(mfrow = c(1,2))
plot(jitter(d$risky.landing, 0.1)~jitter(d[[vars[i]]]),xlab = colnames(d)[1],ylab = "risky.landing",pch=".")
d$risky.landing <- factor(d$risky.landing, levels = c(0, 1))
plot(risky.landing ~ ., data = d)
par(mfrow = c(1,1))
print(ggplot(d,aes_string(x= colnames(d)[1], fill = "risky.landing", colour = "risky.landing")) +
geom_histogram(position = "dodge",binwidth = 1))
}
Full model, based on the earlier results and removing multicollinearity
full_model_2 <- glm(risky.landing~speed_ground + aircraft1, family = "binomial", data = faa6)
summary(full_model_2)
##
## Call:
## glm(formula = risky.landing ~ speed_ground + aircraft1, family = "binomial",
## data = faa6)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.24398 -0.00011 0.00000 0.00000 1.61021
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -102.0772 24.7751 -4.120 3.79e-05 ***
## speed_ground 0.9263 0.2248 4.121 3.78e-05 ***
## aircraft1 4.0190 1.2494 3.217 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 436.043 on 830 degrees of freedom
## Residual deviance: 40.097 on 828 degrees of freedom
## AIC: 46.097
##
## Number of Fisher Scoring iterations: 12
BIC(full_model_2)
## [1] 60.26449
Forward selection using AIC
faa7 <- na.omit(faa6)
full <- glm(risky.landing~.-long.landing-aircraft, family = "binomial", data = faa7)
nothing <- glm(risky.landing~1, family = "binomial", data = faa7)
full.forward.AIC1 <- step(nothing,
scope = list(lower = formula(nothing),
upper = formula(full)),
direction = "forward")
summary(full.forward.AIC1)
##
## Call:
## glm(formula = risky.landing ~ speed_air + aircraft1, family = "binomial",
## data = faa7)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.67140 -0.00714 -0.00063 0.00008 2.47432
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -133.7174 33.5105 -3.990 6.60e-05 ***
## speed_air 1.2206 0.3064 3.984 6.78e-05 ***
## aircraft1 4.5499 1.5124 3.008 0.00263 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 240.724 on 194 degrees of freedom
## Residual deviance: 26.281 on 192 degrees of freedom
## AIC: 32.281
##
## Number of Fisher Scoring iterations: 9
BIC(full.forward.AIC1)
## [1] 42.09974
Forward selection using BIC
full.forward.BIC1 <- step(nothing,
scope = list(lower = formula(nothing),
upper = formula(full)),
direction = "forward",
k = log(nrow(faa7)))
summary(full.forward.BIC1)
##
## Call:
## glm(formula = risky.landing ~ speed_air + aircraft1, family = "binomial",
## data = faa7)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.67140 -0.00714 -0.00063 0.00008 2.47432
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -133.7174 33.5105 -3.990 6.60e-05 ***
## speed_air 1.2206 0.3064 3.984 6.78e-05 ***
## aircraft1 4.5499 1.5124 3.008 0.00263 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 240.724 on 194 degrees of freedom
## Residual deviance: 26.281 on 192 degrees of freedom
## AIC: 32.281
##
## Number of Fisher Scoring iterations: 9
BIC(full.forward.BIC1)
## [1] 42.09974
Step10 - Presentation to FAA agent
risky.landing ~ speed_air + aircraft1
## variable names size of coef odds ratio Direction
## 1 speed_air 1.2206 3.38922065607726 Positive
## 2 aircraft1 4.5499 94.6229455472389 Positive
Step11: Summary of differences between long landing and risky landing models
Step 12: ROC curve
preds=predict(full.forward.BIC)
roc1=roc(faa7$long.landing ~ preds)
plot(roc1)
preds2 <- predict(full.forward.BIC1)
roc2 <- roc(faa7$risky.landing ~ preds2)
plot(roc2, add=TRUE, col='red')
* Risky landing has a better ROC curve than long landing.
auc(roc1)
## Area under the curve: 0.9945
auc(roc2)
## Area under the curve: 0.9962
Step13: Predicting the probability of long landing and risky landing for new data
new.ind<-data.frame(aircraft="Boeing",duration=200,no_pasg=80,speed_ground=115,speed_air=120,height=40,pitch=4,aircraft1 = 1)
## Long landing probability prediction
predict(full.forward.BIC,newdata=new.ind,type="response") ### Probability
pred_bic <- predict(full.forward.BIC,newdata=new.ind,type="response",se=T)
c(pred_bic$fit[1]-1.96*as.numeric(pred_bic$se.fit[1]),pred_bic$fit[1]+1.96*as.numeric(pred_bic$se.fit[1]))
## risky landing probability prediction
predict(full.forward.BIC1,newdata=new.ind,type="response") ### Probability
pred_bic1 <- predict(full.forward.BIC1,newdata=new.ind,type="response",se=T)
c(pred_bic1$fit[1]-1.96*as.numeric(pred_bic1$se.fit[1]),pred_bic1$fit[1]+1.96*as.numeric(pred_bic1$se.fit[1]))
The probability of long landing and risky landing for the given data is 1, as speed_air is 120 which is at the higher end always results in long landing and risky landing.
The confidence interval for both the probabilities is from 0.9999997 to 1.0000002