library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readxl)
library(dplyr)
library(corrr)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Backgroud: Flight landing.
Motivation: To reduce the risk of landing overrun
Goal: To study what factors and how they would impact the landing distance of a commercial flight.
Data: Landing data (landing distance and other parameters) from 950 commercial flights (not real data set but simulated from statistical models)
Flights_800 <- read_xls("~/Desktop/Subjects/Flex 3/Statistical Modelling/WeeK 1/FAA1-1.xls")
Flights_150 <- read_xls("~/Desktop/Subjects/Flex 3/Statistical Modelling/WeeK 1/FAA2-1.xls")
str(Flights_800)
## Classes 'tbl_df', 'tbl' and 'data.frame': 800 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Data Set Flights_150: The class of the data set is data frame with 150 observations of 7 variables. The variable aircraft is of the class ‘character variable’ where as the remmaining 6 variables are of the class ‘numeric variables’.
Difference in Data sets: We observe that there is a difference between two data sets ‘Flights_800’ and ‘Flights_150’. The data set ‘Flights_800’ has a variable ‘duration’ which is not present in ‘Flights_150’.
str(Flights_150)
## Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 7 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
The data sets are merged using the function ‘rbind’ (row bind). This function is usually used when we want two append two data sets of similar kind.
Before merging the data sets, a column ‘duration’ has been created in the data set ‘Flights_150’ to make both the data sets similar.
After merging the data sets, the combined data set was checked for duplicates without using the column ‘duration’ as it was not present in the data set ‘Flights_150’. The combined data set had 100 duplicates. Those duplicates are removed from the combined data set. The usage of ‘Duplicated’ function was found using the google search. This is the link where it shows the usage of the function.
Flights_150$duration <- NA
flights_final <- rbind(Flights_800, Flights_150)
flights_columns <- flights_final[c("aircraft" , "no_pasg" , "speed_ground" ,"speed_air" , "height" , "pitch" , "distance" )]
flights_final <- flights_final[!duplicated(flights_columns),]
str(flights_final)
## Classes 'tbl_df', 'tbl' and 'data.frame': 850 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
summary(flights_final)
## aircraft duration no_pasg speed_ground
## Length:850 Min. : 14.76 Min. :29.0 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90
## Mode :character Median :153.95 Median :60.0 Median : 79.64
## Mean :154.01 Mean :60.1 Mean : 79.45
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06
## Max. :305.62 Max. :87.0 Max. :141.22
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :101.15 Median :30.093 Median :4.008 Median :1258.09
## Mean :103.80 Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :642
The landing aircraft is required to be at least 6 meters high at the threshold of the runway. It is observed that min Height is negative. This must a data recording error as it is not possibe to have negative Height.
If the value of Ground Speed is less than 30MPH or greater than 140MPH, then the landing is considered as abnormal. It is observed that the minimum and maximum values of Ground Speed are 27.74 and 141.22 respectively. It shows that there are abnormal landings in our data set.
If the value of Air Speed is less than 30MPH or greater than 140MPH, then the landing is considered as abnormal. It is observed that th maximum value of Air Speed is 141.72 which is abnormal. Also, there are 642 missing values in the data.
We observe that the minimum value of duration is 14.76 mins which should not be the case as the duration of a normal flight should always be greater than 40 mins.
flights_final <- filter(flights_final, ifelse(is.na(height), TRUE, height >= 6))
flights_final <- filter(flights_final, ifelse(is.na(speed_ground), TRUE, (speed_ground >= 30 & speed_ground <= 140)))
flights_final <- filter(flights_final, ifelse(is.na(speed_air), TRUE, (speed_air >= 30 & speed_air <= 140)))
flights_final <- filter(flights_final, ifelse(is.na(duration), TRUE, duration >= 40 ))
dim(flights_final)
## [1] 832 8
str(flights_final)
## Classes 'tbl_df', 'tbl' and 'data.frame': 832 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
summary(flights_final)
## aircraft duration no_pasg speed_ground
## Length:832 Min. : 41.95 Min. :29.00 Min. : 33.57
## Class :character 1st Qu.:119.65 1st Qu.:55.00 1st Qu.: 66.20
## Mode :character Median :154.26 Median :60.00 Median : 79.83
## Mean :154.73 Mean :60.06 Mean : 79.61
## 3rd Qu.:189.64 3rd Qu.:65.00 3rd Qu.: 91.99
## Max. :305.62 Max. :87.00 Max. :136.66
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.25 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.43
## Median :101.15 Median :30.185 Median :4.002 Median :1263.54
## Mean :103.65 Mean :30.474 Mean :4.005 Mean :1528.24
## 3rd Qu.:109.40 3rd Qu.:37.018 3rd Qu.:4.370 3rd Qu.:1937.58
## Max. :136.42 Max. :59.946 Max. :5.927 Max. :6309.95
## NA's :628
flights_final %>% group_by(aircraft) %>% summarize(n())
hist(flights_final$duration)
hist(flights_final$no_pasg)
hist(flights_final$speed_ground)
hist(flights_final$speed_air)
hist(flights_final$height)
hist(flights_final$pitch)
hist(flights_final$distance)
We observe that there are missing values in the variables ‘speed_air’ and ‘duration’ in the cleaned data set.
In our cleaned data set, there are no abnormal observations in the variables duration, speed_ground, speed_air, and height.
We observe that the variables speed_ground, duration, no_pasg, height and pitch follow normal distribution.
The variables speed_air and distance do not follow normal distribution and have their curves skewed towards right. It means that the minimum values in these variables are closer to mean and median while the maximum values are far from the mean and median. As we also observe from the summary that for all the variable except distance, the means and medians are close to each other.
We found that 444 of the total observations are of the type ‘airbus’ and the remaining 388 observations are of the type ‘boeing’
The Table_1 shows the corrlation of “Landing Distance” with all the other variables present in the data set. It was created using the library “corrr”. This was found by using the google and this is its link
Before creating Table_1, a variable “aircraft_num” was introduced to accomodate the character variable “aircraft”
The table is sorted in the descending order of absolute values of correlation. We see that the factor that affect most to the “Landing Distance” are “speed_air” and “speed_ground”.
flights_final$aircraft_num <- ifelse(flights_final$aircraft == "airbus", 1, 0)
Table_1 <- flights_final[,2:9] %>% correlate() %>% focus(distance)
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
Table_1$direction_of_correlation <- ifelse(Table_1$distance < 0, "Negative", "Positive")
Table_1 <- rename(Table_1, size_of_correlation = distance)
Table_1 <- rename(Table_1, Variable_Names = rowname)
Table_1 <- arrange(Table_1, desc(abs(Table_1$size_of_correlation)))
Table_1
plot(flights_final$distance, flights_final$speed_air)
plot(flights_final$distance, flights_final$speed_ground)
plot(flights_final$distance, flights_final$height)
plot(flights_final$distance, flights_final$pitch)
plot(flights_final$distance, flights_final$duration)
plot(flights_final$distance, flights_final$no_pasg)
plot(flights_final$distance, flights_final$aircraft_num)
speed_air <- lm(distance ~ speed_air, data = flights_final)
speed_ground <- lm(distance ~ speed_ground, data = flights_final)
aircraft_num <- lm(distance ~ aircraft_num, data = flights_final)
height <- lm(distance ~ height, data = flights_final)
pitch <- lm(distance ~ pitch, data = flights_final)
duration <- lm(distance ~ duration, data = flights_final)
no_pasg <- lm(distance ~ no_pasg, data = flights_final)
variable_names <- c("speed_air", "speed_ground", "aircraft_num", "height", "pitch", "duration", "no_pasg")
p_values <- c(
summary(speed_air)$coefficients[2,4],
summary(speed_ground)$coefficients[2,4],
summary(aircraft_num)$coefficients[2,4],
summary(height)$coefficients[2,4],
summary(pitch)$coefficients[2,4],
summary(duration)$coefficients[2,4],
summary(no_pasg)$coefficients[2,4])
regression_coefs <- c(
summary(speed_air)$coefficients[2,1],
summary(speed_ground)$coefficients[2,1],
summary(aircraft_num)$coefficients[2,1],
summary(height)$coefficients[2,1],
summary(pitch)$coefficients[2,1],
summary(duration)$coefficients[2,1],
summary(no_pasg)$coefficients[2,1])
Table_2 <- data.frame(variable_names, p_values, regression_coefs)
Table_2$direction_of_coefficient <- ifelse(Table_2$regression_coefs < 0, "Negative", "Positive")
Table_2 <- dplyr::select(Table_2, variable_names, p_values, direction_of_coefficient)
Table_2 <- arrange(Table_2, p_values)
Table_2
#Calculating Means
speed_air_bar <- mean(flights_final$speed_air, na.rm = TRUE)
speed_ground_bar <- mean(flights_final$speed_ground)
aircraft_num_bar <- mean(flights_final$aircraft_num)
height_bar <- mean(flights_final$height)
pitch_bar <- mean(flights_final$pitch)
duration_bar <- mean(flights_final$duration, na.rm = TRUE)
no_pasg_bar <- mean(flights_final$no_pasg)
#Calculting Standard Deviations
speed_air_sd <- sd(flights_final$speed_air, na.rm = TRUE)
speed_ground_sd <- sd(flights_final$speed_ground)
aircraft_num_sd <- sd(flights_final$aircraft_num)
height_sd <- sd(flights_final$height)
pitch_sd <- sd(flights_final$pitch)
duration_sd <- sd(flights_final$duration, na.rm = TRUE)
no_pasg_sd <- sd(flights_final$no_pasg)
#Converting to standardised variables
flights_final$speed_air_std <- (flights_final$speed_air- speed_air_bar) * (1/speed_air_sd)
flights_final$speed_ground_std <- (flights_final$speed_ground - speed_ground_bar) * (1/speed_ground_sd )
flights_final$aircraft_num_std <- (flights_final$aircraft_num - aircraft_num_bar) * (1/aircraft_num_sd)
flights_final$height_std <- (flights_final$height - height_bar) * (1/height_sd)
flights_final$pitch_std <- (flights_final$pitch - pitch_bar) * (1/pitch_sd)
flights_final$duration_std <- (flights_final$duration - duration_bar) * (1/duration_sd)
flights_final$no_pasg_std <- (flights_final$no_pasg - no_pasg_bar) * (1/no_pasg_sd)
# Regressing "Landing Distance" on the standardised X variables
speed_air_std <- lm(distance ~ speed_air_std, data = flights_final)
speed_ground_std <- lm(distance ~ speed_ground_std, data = flights_final)
aircraft_num_std <- lm(distance ~ aircraft_num_std, data = flights_final)
height_std <- lm(distance ~ height_std, data = flights_final)
pitch_std <- lm(distance ~ pitch_std, data = flights_final)
duration_std <- lm(distance ~ duration_std, data = flights_final)
no_pasg_std <- lm(distance ~ no_pasg_std, data = flights_final)
variable_names_std <- c("speed_air_std", "speed_ground_std", "aircraft_num_std", "height_std", "pitch_std", "duration_std", "no_pasg_std")
regression_coefs_std <- c(
summary(speed_air_std)$coefficients[2,1],
summary(speed_ground_std)$coefficients[2,1],
summary(aircraft_num_std)$coefficients[2,1],
summary(height_std)$coefficients[2,1],
summary(pitch_std)$coefficients[2,1],
summary(duration_std)$coefficients[2,1],
summary(no_pasg_std)$coefficients[2,1])
Table_3 <- data.frame(variable_names_std, regression_coefs_std)
Table_3$direction_of_coefs <- ifelse(Table_3$regression_coefs_std < 0, "Negative", "Positive")
Table_3
Importance_Ranks <- c(2,1,3,4,5,6,7)
Table_0 <- data.frame(variable_names, Importance_Ranks)
Table_0 <- arrange(Table_0, Importance_Ranks)
Table_0
It is observed that when “Landing Distance” is regressed on “speed_air” and speed_ground" both the predictor variables, there is an erractic change in the values of the regression coefficients. We also observe a sign change (negative) for the regression coefficient of speed ground which was positive when “Landing Distance” was only regressed on it.
The correlation of speed_air and speed_ground is 0.99, which suggests that both these predictor variables are strongly associated with each other.
If we want to select any one of them, we would select the speed_ground. The reason being we have seen the result Tables 1, 2, 3 that it is the most important predictor variable.
model_3 <- lm(distance ~ speed_air + speed_ground, data = flights_final)
summary(speed_air)$coefficients[2,1]
## [1] 81.01572
summary(speed_ground)$coefficients[2,1]
## [1] 41.91088
summary(model_3)$coefficients[2:3,1]
## speed_air speed_ground
## 95.10311 -14.03113
cor(flights_final$speed_ground, flights_final$speed_air)
## [1] NA
corr.test(flights_final[,4:5], use = 'pairwise')
## Call:corr.test(x = flights_final[, 4:5], use = "pairwise")
## Correlation matrix
## speed_ground speed_air
## speed_ground 1.00 0.99
## speed_air 0.99 1.00
## Sample Size
## speed_ground speed_air
## speed_ground 832 204
## speed_air 204 204
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## speed_ground speed_air
## speed_ground 0 0
## speed_air 0 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
model_01 <- lm(distance ~ speed_ground, data = flights_final)
model_02 <- lm(distance ~ speed_air + speed_ground, data = flights_final)
model_03 <- lm(distance ~ speed_air + speed_ground + aircraft_num, data = flights_final)
model_04 <- lm(distance ~ speed_air + speed_ground + aircraft_num + height, data = flights_final)
model_05 <- lm(distance ~ speed_air + speed_ground + aircraft_num + height + pitch, data = flights_final)
model_06 <- lm(distance ~ speed_air + speed_ground + aircraft_num + height + pitch + duration, data = flights_final)
model_07 <- lm(distance ~ speed_air + speed_ground + aircraft_num + height + pitch + duration + no_pasg, data = flights_final)
r_square <- c(
summary(model_01)$r.squared,
summary(model_02)$r.squared,
summary(model_03)$r.squared,
summary(model_04)$r.squared,
summary(model_05)$r.squared,
summary(model_06)$r.squared,
summary(model_07)$r.squared)
no_of_predictors <- c(1,2,3,4,5,6,7)
r_square_table <- data.frame(r_square, no_of_predictors)
plot(r_square_table$no_of_predictors, r_square_table$r_square)
lines(r_square_table$no_of_predictors, r_square_table$r_square)
adj_r_square <- c(
summary(model_01)$adj.r.squared,
summary(model_02)$adj.r.squared,
summary(model_03)$adj.r.squared,
summary(model_04)$adj.r.squared,
summary(model_05)$adj.r.squared,
summary(model_06)$adj.r.squared,
summary(model_07)$adj.r.squared)
adj_r_square_table <- data.frame(adj_r_square, no_of_predictors)
plot(adj_r_square_table$no_of_predictors, adj_r_square_table$adj_r_square)
lines(adj_r_square_table$no_of_predictors, adj_r_square_table$adj_r_square)
AIC_values <- c(AIC(model_01),
AIC(model_02),
AIC(model_03),
AIC(model_04),
AIC(model_05),
AIC(model_06),
AIC(model_07))
AIC_table <- data.frame(AIC_values, no_of_predictors)
plot(AIC_table$no_of_predictors, AIC_table$AIC_values)
lines(AIC_table$no_of_predictors, AIC_table$AIC_values)
We will be using the stepAIC fundtion in R to check our results in the previous steps. Since, duration and speed_air has NA’s we will be excluding the, from our analysis.
We have used the bi-directional (forward and backward) stepwise regression. The results show that the AIC value is minimum when none of the variables are removed. When we compare results with the one in Step 19, the values are different as stepAIC does not work when there are NA’s in the data. We have to either remove the records with NA values or remove the variables that have NA’s. In this case we have excluded the variables with NA’s and hence the results are varying when compared with step 19.
flights_2 <- dplyr::select(flights_final, speed_ground, aircraft_num, height, pitch, no_pasg, distance)
LM <- lm(distance ~ ., data = flights_2)
fit1_LM <- stepAIC(LM, direction = 'both')
## Start: AIC=9774.95
## distance ~ speed_ground + aircraft_num + height + pitch + no_pasg
##
## Df Sum of Sq RSS AIC
## - no_pasg 1 196930 104014783 9774.5
## <none> 103817853 9775.0
## - pitch 1 306686 104124539 9775.4
## - height 1 16737232 120555085 9897.3
## - aircraft_num 1 42765976 146583829 10060.0
## - speed_ground 1 537515402 641333255 11287.9
##
## Step: AIC=9774.53
## distance ~ speed_ground + aircraft_num + height + pitch
##
## Df Sum of Sq RSS AIC
## <none> 104014783 9774.5
## + no_pasg 1 196930 103817853 9775.0
## - pitch 1 312651 104327433 9775.0
## - height 1 16601650 120616433 9895.7
## - aircraft_num 1 42867444 146882227 10059.6
## - speed_ground 1 537455111 641469893 11286.1