Code
library(readxl)
FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")Group Members : Utsav Raj, Rohit Menon, Satarupa Saha, Mounish Sunkara
Data set: Landing data from 950 commercial flights
Data dictionary
• aircraft: The make of the aircraft (e.g., Boeing or Airbus).
• duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
• no_pasg: The number of passengers in the flight.
• speed_ground (in miles per hour): The ground speed of the aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
• speed_air (in miles per hour): The air speed of the aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
• height (in meters): The height of the aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
• pitch (in degrees): Pitch angle of the aircraft when it is passing over the threshold of the runway.
• distance (in feet): The landing distance of the aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.
library(readxl)
FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")str(FAA1)tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
$ aircraft : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
$ duration : num [1:800] 98.5 125.7 112 196.8 90.1 ...
$ no_pasg : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num [1:800] 109 103 NA NA NA ...
$ height : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num [1:800] 3370 2988 1145 1664 1050 ...
str(FAA2)tibble [150 × 7] (S3: tbl_df/tbl/data.frame)
$ aircraft : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
$ no_pasg : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num [1:150] 109 103 NA NA NA ...
$ height : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num [1:150] 3370 2988 1145 1664 1050 ...
Data set FAA1 contains 800 rows and 8 columns while FAA2 has 150 rows and 7 columns and duration column is missing from FAA2 file.
Final_FAA<-bind_rows( FAA1, FAA2)
Final_FAA$dup_flag<-as.numeric(duplicated(Final_FAA[, c(1, 3:8)]))
Final_FAA<-Final_FAA%>%filter(dup_flag==0)
Final_FAA<- Final_FAA[, 1:8]We found 100 duplicates after merging the files.For further analysis we have removed the duplicate records.
Structure of the combined Data Set
str(Final_FAA)tibble [850 × 8] (S3: tbl_df/tbl/data.frame)
$ aircraft : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
$ duration : num [1:850] 98.5 125.7 112 196.8 90.1 ...
$ no_pasg : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num [1:850] 109 103 NA NA NA ...
$ height : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num [1:850] 3370 2988 1145 1664 1050 ...
Summary of the combined Data Set
knitr::kable(summary(Final_FAA))| aircraft | duration | no_pasg | speed_ground | speed_air | height | pitch | distance | |
|---|---|---|---|---|---|---|---|---|
| Length:850 | Min. : 14.76 | Min. :29.0 | Min. : 27.74 | Min. : 90.00 | Min. :-3.546 | Min. :2.284 | Min. : 34.08 | |
| Class :character | 1st Qu.:119.49 | 1st Qu.:55.0 | 1st Qu.: 65.90 | 1st Qu.: 96.25 | 1st Qu.:23.314 | 1st Qu.:3.642 | 1st Qu.: 883.79 | |
| Mode :character | Median :153.95 | Median :60.0 | Median : 79.64 | Median :101.15 | Median :30.093 | Median :4.008 | Median :1258.09 | |
| NA | Mean :154.01 | Mean :60.1 | Mean : 79.45 | Mean :103.80 | Mean :30.144 | Mean :4.009 | Mean :1526.02 | |
| NA | 3rd Qu.:188.91 | 3rd Qu.:65.0 | 3rd Qu.: 92.06 | 3rd Qu.:109.40 | 3rd Qu.:36.993 | 3rd Qu.:4.377 | 3rd Qu.:1936.95 | |
| NA | Max. :305.62 | Max. :87.0 | Max. :141.22 | Max. :141.72 | Max. :59.946 | Max. :5.927 | Max. :6533.05 | |
| NA | NA’s :50 | NA | NA | NA’s :642 | NA | NA | NA |
Box Plot:
par(mfrow=c(2,4))
for (i in 2:ncol(Final_FAA)){
boxplot(Final_FAA[,i],main=colnames(Final_FAA[i]))
}2 types of aircraft : Boeing and Airbus
Airspeed is not captured for 75% of data
Duration is not captured in FAA2 file
Removed 100 duplicates after merging both the files
Data irregularity for other columns. Min Height is negative, min value of distance is 14 mins ,speed group is also abnormal for few cases
Filter_Final_FAA <- Final_FAA %>%
filter(((duration>40) | is.na(duration)) &
((speed_ground>=30 & speed_ground<=140)| is.na(speed_ground)) &
((speed_air>=30 & speed_air<=140)| is.na(speed_air))&
((height>=6)| is.na(height))&
((distance<6000)| is.na(distance)))We have filtered 19 rows based based on the criteria provided in the data dictionary.
summary(Filter_Final_FAA) aircraft duration no_pasg speed_ground
Length:831 Min. : 41.95 Min. :29.00 Min. : 33.57
Class :character 1st Qu.:119.63 1st Qu.:55.00 1st Qu.: 66.20
Mode :character Median :154.28 Median :60.00 Median : 79.79
Mean :154.78 Mean :60.06 Mean : 79.54
3rd Qu.:189.66 3rd Qu.:65.00 3rd Qu.: 91.91
Max. :305.62 Max. :87.00 Max. :132.78
NA's :50
speed_air height pitch distance
Min. : 90.00 Min. : 6.228 Min. :2.284 Min. : 41.72
1st Qu.: 96.23 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.28
Median :101.12 Median :30.167 Median :4.001 Median :1262.15
Mean :103.48 Mean :30.458 Mean :4.005 Mean :1522.48
3rd Qu.:109.36 3rd Qu.:37.004 3rd Qu.:4.370 3rd Qu.:1936.63
Max. :132.91 Max. :59.946 Max. :5.927 Max. :5381.96
NA's :628
hist.data.frame(Filter_Final_FAA)We have filtered 19 rows based based on the criteria provided in the data dictionary.
Airspeed is not captured for 75% of data
Duration is not captured in FAA2 file
Distribution of all variables other than air speed and distance seem to follow normal distribution. Air Speed and distance are right skewed.
Filter_Final_FAA_encode <- Filter_Final_FAA
Filter_Final_FAA_encode$aircraft <- ifelse(Filter_Final_FAA_encode$aircraft =="boeing", 0, 1)
table1 <- data.frame(cor(Filter_Final_FAA_encode[1:7], Filter_Final_FAA_encode$distance,use="complete.obs"))
colnames(table1) <- c("Size_Of_Correlation")
table1 <- tibble::rownames_to_column(table1, "Variable") %>%
mutate(Correlation_Dir = ifelse(Size_Of_Correlation < 0, "-ve","+ve"))%>%
arrange(desc(abs(Size_Of_Correlation)))
table1 Variable Size_Of_Correlation Correlation_Dir
1 speed_air 0.94321897 +ve
2 speed_ground 0.92877195 +ve
3 aircraft -0.17251928 -ve
4 height 0.05775639 +ve
5 duration 0.05241698 +ve
6 pitch 0.03402263 +ve
7 no_pasg -0.03258255 -ve
# Air Speed
plot_1 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = speed_air)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "speed_air", title = "")
# Ground Speed
plot_2 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = speed_ground)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "speed_ground", title = "")
# Height
plot_3 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = height)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "height", title = "")
# Pitch
plot_4 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = pitch)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "pitch", title = "")
# Duration
plot_5 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = duration)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "duration", title = "")
# Number of Passengers
plot_6 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = no_pasg)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "no_pasg", title = "")
gridExtra::grid.arrange(plot_1, plot_2, plot_3, plot_4, plot_5, plot_6, ncol = 3)We see strong correlation between distance and ground speed and air speed.
model<-lapply(c(1:7), function(i) lm(Filter_Final_FAA_encode[["distance"]]~Filter_Final_FAA[[i]]))
pvalues<-lapply(c(1:7), function(i) summary(model[[i]])$coefficients[2,4])
pvalues<- unlist(pvalues)
#pvalues
table2<-data.frame(colnames(Filter_Final_FAA_encode[,1:7]), pvalues)
names(table2)[1] <- 'Variable'
table2<- table2%>%arrange(pvalues)
table2 Variable pvalues
1 speed_ground 4.766371e-252
2 speed_air 2.500461e-97
3 aircraft 3.526194e-12
4 height 4.123860e-03
5 pitch 1.208124e-02
6 duration 1.514002e-01
7 no_pasg 6.092520e-01
FAA_Stand<-data.frame(scale(Filter_Final_FAA_encode[,1:7]), Filter_Final_FAA_encode$distance)
names(FAA_Stand)[8] <- 'distance'
model_stand<-lapply(c(1:7), function(i) lm(Filter_Final_FAA_encode[["distance"]]~FAA_Stand[[i]]))
coef<-lapply(c(1:7), function(i) summary(model_stand[[i]])$coefficients[2,1])
coef<- unlist(coef)
table3<-data.frame(colnames(Filter_Final_FAA_encode[,1:7]), coef)
names(table3)[1] <- 'Variable'
table3<- table3%>%arrange(desc(abs(coef)))%>%
mutate(coef_direction = ifelse(coef < 0, "-ve","+ve"))
table3 Variable coef coef_direction
1 speed_ground 776.44740 +ve
2 speed_air 774.34654 +ve
3 aircraft -213.45802 -ve
4 height 89.10606 +ve
5 pitch 78.00693 +ve
6 duration -46.48013 -ve
7 no_pasg -15.91595 -ve
table0<-merge(x=table1,y=table2,by="Variable")
table0<-merge(x=table0,y=table3,by="Variable")
table0 Variable Size_Of_Correlation Correlation_Dir pvalues coef
1 aircraft -0.17251928 -ve 3.526194e-12 -213.45802
2 duration 0.05241698 +ve 1.514002e-01 -46.48013
3 height 0.05775639 +ve 4.123860e-03 89.10606
4 no_pasg -0.03258255 -ve 6.092520e-01 -15.91595
5 pitch 0.03402263 +ve 1.208124e-02 78.00693
6 speed_air 0.94321897 +ve 2.500461e-97 774.34654
7 speed_ground 0.92877195 +ve 4.766371e-252 776.44740
coef_direction
1 -ve
2 -ve
3 +ve
4 -ve
5 +ve
6 +ve
7 +ve
model_1 <- lm(distance ~ speed_ground, data = FAA_Stand)
model_2 <- lm(distance ~ speed_air, data = FAA_Stand)
model_3 <- lm(distance ~ speed_ground + speed_air, data = FAA_Stand)
summary(model_1)$coefficients[,1] (Intercept) speed_ground
1522.4829 776.4474
summary(model_2)$coefficients[,1](Intercept) speed_air
2774.6729 774.3465
summary(model_3)$coefficients[,1] (Intercept) speed_ground speed_air
3117.7445 -269.2959 914.8089
GGally::ggpairs(data = FAA_Stand, columns =4:5 )Positive coefficients and significant when we are doing regress individually Negative coefficients for speed ground and p value > 0.05 for model 3 High correlation b/w air speed and ground speed. ~0.98 We should select ground speed. 75% values are missing for air speed.
var_selection <-c()
var_selection[1] <- summary(lm(distance ~ speed_ground , data = FAA_Stand))$r.squared
var_selection[2] <- summary(lm(distance ~ speed_ground + aircraft , data = FAA_Stand))$r.squared
var_selection[3] <- summary(lm(distance ~ speed_ground + aircraft + height, data = FAA_Stand))$r.squared
var_selection[4] <- summary(lm(distance ~ speed_ground + aircraft + height + duration , data = FAA_Stand))$r.squared
var_selection[5] <- summary(lm(distance ~ speed_ground + aircraft + height + duration + pitch , data = FAA_Stand))$r.squared
var_selection[6] <- summary(lm(distance ~ speed_ground + aircraft + height + duration + pitch + no_pasg, data = FAA_Stand))$r.squared
parameters <-c(1:6)
var_selection_result <- data.frame(var_selection,parameters)
var_selection_result var_selection parameters
1 0.7503784 1
2 0.8251319 2
3 0.8488989 3
4 0.8503048 4
5 0.8504184 5
6 0.8506023 6
var_selection_result %>%
ggplot(aes(x = parameters , y = var_selection)) +
geom_line() +
xlab("no. of variables") +
ylab("R-square") +
theme_classic()After 3 variables the r sqr value is not increasing. This implies that new variable addition is not able to explain the variation.
var_selection_adj <-c()
var_selection_adj[1] <- summary(lm(distance ~ speed_ground , data = FAA_Stand))$adj.r.squared
var_selection_adj[2] <- summary(lm(distance ~ speed_ground + aircraft , data = FAA_Stand))$adj.r.squared
var_selection_adj[3] <- summary(lm(distance ~ speed_ground + aircraft + height, data = FAA_Stand))$adj.r.squared
var_selection_adj[4] <- summary(lm(distance ~ speed_ground + aircraft + height + duration , data = FAA_Stand))$adj.r.squared
var_selection_adj[5] <- summary(lm(distance ~ speed_ground + aircraft + height + duration + pitch , data = FAA_Stand))$adj.r.squared
var_selection_adj[6] <- summary(lm(distance ~ speed_ground + aircraft + height + duration + pitch + no_pasg, data = FAA_Stand))$adj.r.squared
parameters <-c(1:6)
var_selection_result_adj <- data.frame(var_selection_adj,parameters)
var_selection_result_adj var_selection_adj parameters
1 0.7500773 1
2 0.8247095 2
3 0.8483508 3
4 0.8495332 4
5 0.8494534 5
6 0.8494442 6
var_selection_result_adj %>%
ggplot(aes(x = parameters , y = var_selection_adj)) +
geom_line() +
xlab("no. of variables") +
ylab("Adjusted R-square") +
theme_classic()aic_method <- c()
aic_method[1] <- AIC(lm(distance ~ speed_ground , data = FAA_Stand))
aic_method[2] <- AIC(lm(distance ~ speed_ground + aircraft , data = FAA_Stand))
aic_method[3] <- AIC(lm(distance ~ speed_ground + aircraft + height, data = FAA_Stand))
aic_method[4] <- AIC(lm(distance ~ speed_ground + aircraft + height + duration, data = FAA_Stand))
aic_method[5] <- AIC(lm(distance ~ speed_ground + aircraft + height + duration + pitch , data = FAA_Stand))
aic_method[6] <- AIC(lm(distance ~ speed_ground + aircraft + height + duration + pitch + no_pasg, data = FAA_Stand))
aic_method[1] 12508.81 12215.05 12095.65 11377.43 11378.84 11379.88
parameters <-c(1:6)
aic_method_selection <- data.frame(aic_method,parameters)
aic_method_selection aic_method parameters
1 12508.81 1
2 12215.05 2
3 12095.65 3
4 11377.43 4
5 11378.84 5
6 11379.88 6
aic_method_selection %>%
ggplot(aes(x = parameters , y = aic_method)) +
geom_line() +
xlab("no. of variables") +
ylab("AIC Value") +
theme_classic()AIC decreases till model with 4 variables. Then its value remains almost the same.As per AIC we should pick 4 variables.
Based on the results we should pick Speed Ground, Aircraft and height