Motivation: To reduce the risk of landing overrun.

Goal: To study what factors and how they would impact the landing distance of a commercial flight.

Packages Required

library(tidyverse)  #to visualize, transform, input, tidy and join data
library(dplyr)      #data wrangling
library(stringr)    #string related functions
library(kableExtra) #to create HTML Table
library(DT)         #to preview the data sets
library(lubridate)  #to apply the date functions
library(xlsx)       #to load excel files
Data has following columns -
Variable Description
Aircraft make of an aircraft
Duration Duration of flight
No_pasg no. of passengers
speed_ground ground speed
speed_air air speed
height height
pitch pitch angle
distance flight duration between take-off and landing

Part-1

Exploratory Data Analysis

step 1: I load the two datasets-

faa1 <- read.xlsx("FAA1.xls", sheetName = "FAA1")
faa2 <- read.xlsx("FAA2_2.xls", sheetName = "Sheet1")

Step 2: Check the structure of each data set using the “str” function. For each data set, what is the sample size and how many variables? Is there any difference between the two data sets?

For FAA1:

str(faa1)
## 'data.frame':    800 obs. of  8 variables:
##  $ aircraft    : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

Data has 800 observations and 8

For FAA2:

str(faa2)
## 'data.frame':    150 obs. of  7 variables:
##  $ aircraft    : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

Data has 150 observations and 7

FAA2 doesn’t contain information about the duration of flights

Step 3: Merge the two data sets. Are there any duplications?

faa <- bind_rows(faa1, faa2)

str(faa)
## 'data.frame':    950 obs. of  8 variables:
##  $ aircraft    : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
#add the duplicate removal code
faa %>% 
  select(-duration) %>% 
  duplicated() %>% 
  sum() 
## [1] 100

There are 100 duplicated in total, which I have removed.

check <- faa %>%  
 select(-duration) %>% 
  duplicated() %>% 
  which()

faa <- faa[-check,]

Step 4. Check the structure of the combined data set. What is the sample size and how many variables? Provide summary statistics for each variable.

str(faa)
## 'data.frame':    850 obs. of  8 variables:
##  $ aircraft    : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

Data has 850 observations and 8 variables.

summary(faa)
##    aircraft      duration         no_pasg      speed_ground   
##  airbus:450   Min.   : 14.76   Min.   :29.0   Min.   : 27.74  
##  boeing:400   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90  
##               Median :153.95   Median :60.0   Median : 79.64  
##               Mean   :154.01   Mean   :60.1   Mean   : 79.45  
##               3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06  
##               Max.   :305.62   Max.   :87.0   Max.   :141.22  
##               NA's   :50                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.25   1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :101.15   Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :103.80   Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:109.40   3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :642

Step 5. Key findings-

I observed that few of the variables have incorrect data, which may be because of the issue with data capture or wrong data entry. For example-

  1. height has negative value as the minimum value

  2. the minimum distance for an observation is 34 which is too small.

  3. air_speed is not captured for 75% of the data.

  4. The minium duration of flight is 15 minutes, which doesn’t seem right

  5. Data had duplicate records(100) after merging the two data-sets

Part-2

Data Cleaning and further exploration

step 6. Are there abnormal values in the data set? Please refer to the variable dictionary for criteria defining “normal/abnormal” values. Remove the rows that contain any “abnormal values” and report how many rows you have removed.

As per obtained summary, speed_air is within the threshold so I won’t apply filter in it.

faa_check <- faa %>% 
  filter((duration > 40| is.na(duration)) & (speed_ground >= 30) & (speed_ground <= 140) &
           (height >= 6) & (distance < 6000)) 
dim(faa_check)
## [1] 831   8
faa <- faa_check

A total of 19 observations seem abnormal which we remove.

Step 7. Summary

str(faa)
## 'data.frame':    831 obs. of  8 variables:
##  $ aircraft    : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
summary(faa)
##    aircraft      duration         no_pasg       speed_ground   
##  airbus:444   Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
##  boeing:387   1st Qu.:119.63   1st Qu.:55.00   1st Qu.: 66.20  
##               Median :154.28   Median :60.00   Median : 79.79  
##               Mean   :154.78   Mean   :60.06   Mean   : 79.54  
##               3rd Qu.:189.66   3rd Qu.:65.00   3rd Qu.: 91.91  
##               Max.   :305.62   Max.   :87.00   Max.   :132.78  
##               NA's   :50                                       
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.23   1st Qu.:23.530   1st Qu.:3.640   1st Qu.: 893.28  
##  Median :101.12   Median :30.167   Median :4.001   Median :1262.15  
##  Mean   :103.48   Mean   :30.458   Mean   :4.005   Mean   :1522.48  
##  3rd Qu.:109.36   3rd Qu.:37.004   3rd Qu.:4.370   3rd Qu.:1936.63  
##  Max.   :132.91   Max.   :59.946   Max.   :5.927   Max.   :5381.96  
##  NA's   :628

Data has 831 observations and 8

We observe that Duration is null for 50 observations, which we need to look at. We will replace the value with mean of the overall column

faa$duration_corrected <- NA
faa <-  transform(faa, duration_corrected = ifelse(is.na(faa$duration), mean(faa$duration, na.rm=TRUE), faa$duration))
summary(faa)
##    aircraft      duration         no_pasg       speed_ground   
##  airbus:444   Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
##  boeing:387   1st Qu.:119.63   1st Qu.:55.00   1st Qu.: 66.20  
##               Median :154.28   Median :60.00   Median : 79.79  
##               Mean   :154.78   Mean   :60.06   Mean   : 79.54  
##               3rd Qu.:189.66   3rd Qu.:65.00   3rd Qu.: 91.91  
##               Max.   :305.62   Max.   :87.00   Max.   :132.78  
##               NA's   :50                                       
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.23   1st Qu.:23.530   1st Qu.:3.640   1st Qu.: 893.28  
##  Median :101.12   Median :30.167   Median :4.001   Median :1262.15  
##  Mean   :103.48   Mean   :30.458   Mean   :4.005   Mean   :1522.48  
##  3rd Qu.:109.36   3rd Qu.:37.004   3rd Qu.:4.370   3rd Qu.:1936.63  
##  Max.   :132.91   Max.   :59.946   Max.   :5.927   Max.   :5381.96  
##  NA's   :628                                                        
##  duration_corrected
##  Min.   : 41.95    
##  1st Qu.:122.67    
##  Median :154.78    
##  Mean   :154.78    
##  3rd Qu.:186.37    
##  Max.   :305.62    
## 

Step 8. Plotting histogram for all the variables.

#hist(faa$duration_impute, main = "Histogram of Duration", xlab = "Duration")
hist(faa$speed_ground, main = "Histogram of Ground Speed", xlab = "Ground Speed")

hist(faa$height, main = "Histogram of Height", xlab = "Height")

hist(faa$pitch, main = "Histogram of Pitch", xlab = "Pitch")

hist(faa$no_pasg, main = "Histogram of No. of Passengers", xlab = "No. of Passengers")

hist(faa$speed_air, main = "Histogram of Air Speed", xlab = "Air Speed")

hist(faa$distance, main = "Histogram of Landing Distance", xlab = "Landing Distance")

hist(faa$duration_corrected, main = "Histogram of Duration of flight", xlab = "Flight Duration in mins")

Step 9. Key finding:

After cleaning the data, I observed that - 1. There were total 19 abnormal values in the data

  1. Duration has 50 NA values, which we corrected based on the mean of the overall sample

  2. Speed of the air is right-skewed whereas all the other variables seem to be noramlly distributed

  3. Min speed of air is 90 MPH

Part-3

Initial analysis for identifying important factors that impact the

response variable “landing distance”

Step 10. Pairwise Correlation

cor_duration <- cor(faa$distance, faa$duration_corrected)
cor_speed_ground <- cor(faa$distance,faa$speed_ground)
cor_height <- cor(faa$distance,faa$height)
cor_pitch <- cor(faa$distance,faa$pitch)
cor_no_pasg <- cor(faa$distance,faa$no_pasg)
cor_speed_air <- cor.test(faa$distance,faa$speed_air,method="pearson")$estimate
cor_aircraft <- cor(faa$distance,as.numeric(faa$aircraft ))

variable_names <- c("Duration","Ground Speed","Height","Pitch","No. of Passengers","Air Speed","Aircraft")
correlation <- c(cor_duration,cor_speed_ground,cor_height,cor_pitch,cor_no_pasg,cor_speed_air,cor_aircraft)

table_1 <- data.frame(variable_names,correlation)

table_1$direction <- ifelse(table_1$correlation > 0, "Positive","Negative")

table_1 <- table_1 %>% arrange(desc(correlation))

Step 11. Show X-Y scatter plots

faa <- faa[-2]
GGally::ggpairs(
  data = faa
)

The plots seem pretty consistent with the correlation values.

Step 12. Encoding aircraft type

GGally::ggpairs(
  data = faa,  diag = list(continuous =
  "densityDiag", discrete = "barDiag", na = "naDiag")
)

Part-4

Regression using a single factor each time

Step 13. Regress Y (landing distance) on each of the X variables.

mdl_duration <- lm (faa$distance ~ faa$duration_corrected)
mdl_speedgrnd <- lm (faa$distance ~ faa$speed_ground)
mdl_height <- lm (faa$distance ~ faa$height)
mdl_pitch <- lm (faa$distance ~ faa$pitch)
mdl_nopasg <- lm (faa$distance ~ faa$no_pasg)
mdl_speedair <- lm (faa$distance ~ faa$speed_air)
mdl_aircraft <- lm (faa$distance ~ faa$aircraft)

duration <- summary(mdl_duration)$coef[2,c(1,4)]
speed_ground <- summary(mdl_speedgrnd)$coef[2,c(1,4)]
height <- summary(mdl_height)$coef[2,c(1,4)]
pitch <- summary(mdl_pitch)$coef[2,c(1,4)]
no_pasg <- summary(mdl_nopasg)$coef[2,c(1,4)]
speed_air <- summary(mdl_speedair)$coef[2,c(1,4)]
aircraft_boeing <- summary(mdl_aircraft)$coef[2,c(1,4)]
aircraft_airbus <- summary(mdl_aircraft)$coef[1,c(1,4)]

variable_names <- c("Duration","Ground Speed","Height","Pitch","No. of Passengers","Air Speed","Aircraft-Boeing", "Aircraft-Airbus")

slope <- c(duration[1], speed_ground[1], height[1], pitch[1], no_pasg[1],speed_air[1],aircraft_boeing[1],aircraft_airbus[1])

slope <- round(slope, digits = 3)

p_value <- c(duration[2], speed_ground[2], height[2], pitch[2], no_pasg[2],speed_air[2],aircraft_boeing[2],aircraft_airbus[2]) 

p_value <- round(p_value, digits = 3)

table_2 <- data.frame(variable_names, slope, p_value)

table_2$slope_direction <- ifelse(slope > 0 , "Positive", "Negative")

table_2 <- table_2 %>% 
           select(variable_names, p_value, slope_direction) %>% 
           arrange(p_value)

table_2
##      variable_names p_value slope_direction
## 1      Ground Speed   0.000        Positive
## 2         Air Speed   0.000        Positive
## 3   Aircraft-Boeing   0.000        Positive
## 4   Aircraft-Airbus   0.000        Positive
## 5            Height   0.004        Positive
## 6             Pitch   0.012        Positive
## 7          Duration   0.148        Negative
## 8 No. of Passengers   0.609        Negative

All the factors are significant except for duration and number of passengers.

Step 14. Standardize each X variable

faa_adj <- faa

faa_adj$duration <- scale(faa_adj$duration_corrected, center = TRUE, scale = TRUE)
faa_adj$speed_ground <- scale(faa_adj$speed_ground, center = TRUE, scale = TRUE)
faa_adj$height <- scale(faa_adj$height, center = TRUE, scale = TRUE)
faa_adj$pitch <- scale(faa_adj$pitch, center = TRUE, scale = TRUE)
faa_adj$no_pasg <- scale(faa_adj$no_pasg, center = TRUE, scale = TRUE)
faa_adj$speed_air <- scale(faa_adj$speed_air, center = TRUE, scale = TRUE)

mdl_duration <- lm (faa_adj$distance ~ faa_adj$duration_corrected)
mdl_speedgrnd <- lm (faa_adj$distance ~ faa_adj$speed_ground)
mdl_height <- lm (faa_adj$distance ~ faa_adj$height)
mdl_pitch <- lm (faa_adj$distance ~ faa_adj$pitch)
mdl_nopasg <- lm (faa_adj$distance ~ faa_adj$no_pasg)
mdl_speedair <- lm (faa_adj$distance ~ faa_adj$speed_air)
mdl_aircraft <- lm (faa_adj$distance ~ faa_adj$aircraft)

duration <- summary(mdl_duration)$coef[2,c(1,4)]
speed_ground <- summary(mdl_speedgrnd)$coef[2,c(1,4)]
height <- summary(mdl_height)$coef[2,c(1,4)]
pitch <- summary(mdl_pitch)$coef[2,c(1,4)]
no_pasg <- summary(mdl_nopasg)$coef[2,c(1,4)]
speed_air <- summary(mdl_speedair)$coef[2,c(1,4)]
aircraft_boeing <- summary(mdl_aircraft)$coef[2,c(1,4)]
aircraft_airbus <- summary(mdl_aircraft)$coef[1,c(1,4)]

variable_names <- c("Duration","Ground Speed","Height","Pitch","No. of Passengers","Air Speed","Aircraft-Boeing", "Aircraft-Airbus")

slope <- c(duration[1], speed_ground[1], height[1], pitch[1], no_pasg[1],speed_air[1],aircraft_boeing[1],aircraft_airbus[1])

slope <- round(slope, digits = 3)

p_value <- c(duration[2], speed_ground[2], height[2], pitch[2], no_pasg[2],speed_air[2],aircraft_boeing[2],aircraft_airbus[2]) 

p_value <- round(p_value, digits = 3)

table_3 <- data.frame(variable_names, slope, p_value)

table_3$slope_direction <- ifelse(slope > 0 , "Positive", "Negative")

table_3 <- table_3 %>% 
           select(variable_names, slope, slope_direction) %>% 
           arrange(desc(slope))

table_3
##      variable_names    slope slope_direction
## 1   Aircraft-Airbus 1323.317        Positive
## 2      Ground Speed  776.447        Positive
## 3         Air Speed  774.347        Positive
## 4   Aircraft-Boeing  427.666        Positive
## 5            Height   89.106        Positive
## 6             Pitch   78.007        Positive
## 7          Duration   -0.961        Negative
## 8 No. of Passengers  -15.916        Negative

There is no difference observed after normalization.

Step 15. Creating Table 0

table_0 <- merge(table_1, table_2) %>% 
  merge(table_3[,-3], by = "variable_names") %>% 
  arrange(desc(slope))
table_0
##      variable_names correlation direction p_value slope_direction   slope
## 1      Ground Speed  0.86624383  Positive   0.000        Positive 776.447
## 2         Air Speed  0.94209714  Positive   0.000        Positive 774.347
## 3            Height  0.09941121  Positive   0.004        Positive  89.106
## 4             Pitch  0.08702846  Positive   0.012        Positive  78.007
## 5          Duration -0.05026941  Negative   0.148        Negative  -0.961
## 6 No. of Passengers -0.01775663  Negative   0.609        Negative -15.916

Part-5

Check collinearity

Step 16. Checking Collinearity

We see that both air_speed and ground_speed are closely related to the landing distance.

model_1 <- lm(distance ~ speed_ground,data=faa)
model_2 <- lm(distance ~ speed_air,data=faa)
model_3 <- lm(distance ~ speed_ground + speed_air,data=faa)

summary(model_1)
## 
## Call:
## lm(formula = distance ~ speed_ground, data = faa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -897.09 -319.16  -72.09  210.83 1798.88 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1773.9407    67.8388  -26.15   <2e-16 ***
## speed_ground    41.4422     0.8302   49.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 448.1 on 829 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7501 
## F-statistic:  2492 on 1 and 829 DF,  p-value: < 2.2e-16
summary(model_2)
## 
## Call:
## lm(formula = distance ~ speed_air, data = faa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -776.21 -196.39    8.72  209.17  624.34 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5455.709    207.547  -26.29   <2e-16 ***
## speed_air      79.532      1.997   39.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.3 on 201 degrees of freedom
##   (628 observations deleted due to missingness)
## Multiple R-squared:  0.8875, Adjusted R-squared:  0.887 
## F-statistic:  1586 on 1 and 201 DF,  p-value: < 2.2e-16
summary(model_3)
## 
## Call:
## lm(formula = distance ~ speed_ground + speed_air, data = faa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -819.74 -202.02    3.52  211.25  636.25 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5462.28     207.48 -26.327  < 2e-16 ***
## speed_ground   -14.37      12.68  -1.133    0.258    
## speed_air       93.96      12.89   7.291 6.99e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 200 degrees of freedom
##   (628 observations deleted due to missingness)
## Multiple R-squared:  0.8883, Adjusted R-squared:  0.8871 
## F-statistic:   795 on 2 and 200 DF,  p-value: < 2.2e-16

Since they are highly correlated, when we regress both the variables together, we see that ground_speed becomes insignificant as the variability due to speed_ground is explained by speed_Air and hence it is not contributing in explaining variation any more.

cor.test(faa$speed_ground, faa$speed_air, method = "pearson")$estimate
##       cor 
## 0.9879383

We observe 98% correlation. I would like to keep speed_air in my method because since R^2 and adj. R^2 is more when speed_air is considered.(comparison of the two models- 1 and 2.) Thus, speed of air is a significant contributor according to me.

Part-6

Variable selection based on our ranking in Table 0.

Step 17. R squared vs No. of variables

model0 <- lm(distance ~ 1,data=faa)
model1 <- lm(distance ~ speed_air,data=faa)
model2 <- lm(distance ~ speed_air + aircraft,data=faa)
model3 <- lm(distance ~ speed_air + aircraft + height ,data=faa)
model4 <- lm(distance ~ speed_air + aircraft + height + no_pasg  ,data=faa)
model5 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected ,data=faa)
model6 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected + pitch,data=faa)

model0.rsqr <- summary(model0)$r.squared
model1.rsqr <- summary(model1)$r.squared
model2.rsqr <- summary(model2)$r.squared
model3.rsqr <- summary(model3)$r.squared
model4.rsqr <- summary(model4)$r.squared
model5.rsqr <- summary(model5)$r.squared
model6.rsqr <- summary(model6)$r.squared

rsquare <- cbind(c(model0.rsqr, model1.rsqr, model2.rsqr, model3.rsqr, model4.rsqr, model5.rsqr, model6.rsqr), 0:6) 

 colnames(rsquare) <- c("rsquare","variables") 

 rsquare <-  as.data.frame(rsquare)
 
 rsquare %>% 
   ggplot(aes(x = variables, y = rsquare)) + 
   geom_line() + 
   xlab("no. of variables") +
   ylab("R-square") +
   theme_classic()

With increase in variables, R^2 also increases.

Step 18.

model0.rsqr <- summary(model0)$adj.r.squared
model1.rsqr <- summary(model1)$adj.r.squared
model2.rsqr <- summary(model2)$adj.r.squared
model3.rsqr <- summary(model3)$adj.r.squared
model4.rsqr <- summary(model4)$adj.r.squared
model5.rsqr <- summary(model5)$adj.r.squared
model6.rsqr <- summary(model6)$adj.r.squared

rsquare <- cbind(c(model0.rsqr, model1.rsqr, model2.rsqr, model3.rsqr, model4.rsqr, model5.rsqr, model6.rsqr), 0:6) 

 colnames(rsquare) <- c("adj_rsquare","variables") 

 rsquare <-  as.data.frame(rsquare)
 
 rsquare %>% 
   ggplot(aes(x = variables, y = adj_rsquare)) + 
   geom_line() + 
   xlab("no. of variables") +
   ylab("Adjusted R-square") +
   theme_classic()

We see that Adjusted R^2 increases initially but then it slowly starts declining after 3 variables have been added.

Step 19. Model with AIC values

model0_AIC <- AIC(model0)
model1_AIC <- AIC(model1)
model2_AIC <- AIC(model2)
model3_AIC <- AIC(model3)
model4_AIC <- AIC(model4)
model5_AIC <- AIC(model5)
model6_AIC <- AIC(model6)

AIC <- cbind(c(model0_AIC, model1_AIC, model2_AIC, model3_AIC, model4_AIC, model5_AIC, model6_AIC),0:6)

colnames(AIC) <- c("AIC","variables")

AIC <- as.data.frame(AIC)

AIC %>% 
  ggplot(aes(x = variables, y = AIC)) +
  geom_line() +
  xlab("no. of variables")+
  ylab("AIC") +
  theme_classic()

since smaller the AIC, better is the model. Hence, we see that it decreases. However, after addition of 3 variables, the decrease in AIC isn’t much and infact it starts going up.

Step 20. suitable model

According to me, the significant model is -

model4 <- lm(distance ~ speed_air + aircraft + height ,data=faa)
summary(model4)
## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = faa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -300.74  -94.78   15.47   97.09  330.41 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6390.376    109.839  -58.18   <2e-16 ***
## speed_air         82.148      0.976   84.17   <2e-16 ***
## aircraftboeing   427.442     19.173   22.29   <2e-16 ***
## height            13.702      1.007   13.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.2 on 199 degrees of freedom
##   (628 observations deleted due to missingness)
## Multiple R-squared:  0.9737, Adjusted R-squared:  0.9733 
## F-statistic:  2458 on 3 and 199 DF,  p-value: < 2.2e-16

The R^2 and Adjusted R^2 are also close and around 97% which is quite right.

Part-7

Variable selection based on automate algorithm.

step 21. using “StepAIC” to perform forward variable selection

Since stepAIC doesn’t work in NA values, I would remove the NA values from my data before running the algorithm.

faaNoNA <- na.exclude(faa)
#nrow(faaNoNA)
model01 <- lm(distance ~ 1,data=faaNoNA)
model11 <- lm(distance ~ speed_air,data=faaNoNA)
model21 <- lm(distance ~ speed_air + aircraft,data=faaNoNA)
model31 <- lm(distance ~ speed_air + aircraft + height ,data=faaNoNA)
model41 <- lm(distance ~ speed_air + aircraft + height + no_pasg  ,data=faaNoNA)
model51 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected ,data=faaNoNA)
model61 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected + pitch,data=faaNoNA)

MASS::stepAIC(model01,direction="forward",scope=list(upper=model61,lower=model01))
## Start:  AIC=2725.93
## distance ~ 1
## 
##                      Df Sum of Sq       RSS    AIC
## + speed_air           1 121121738  15346230 2284.3
## + aircraft            1   4417704 132050265 2721.2
## <none>                            136467968 2725.9
## + height              1    633675 135834293 2727.0
## + duration_corrected  1    353842 136114126 2727.4
## + pitch               1    244706 136223262 2727.6
## + no_pasg             1    217724 136250245 2727.6
## 
## Step:  AIC=2284.33
## distance ~ speed_air
## 
##                      Df Sum of Sq      RSS    AIC
## + aircraft            1   8424832  6921399 2124.7
## + height              1   2803377 12542854 2245.4
## + pitch               1    860427 14485804 2274.6
## + no_pasg             1    159095 15187136 2284.2
## <none>                            15346230 2284.3
## + duration_corrected  1     11938 15334292 2286.2
## 
## Step:  AIC=2124.7
## distance ~ speed_air + aircraft
## 
##                      Df Sum of Sq     RSS    AIC
## + height              1   3335147 3586252 1993.2
## <none>                            6921399 2124.7
## + duration_corrected  1     61095 6860304 2124.9
## + no_pasg             1     56195 6865204 2125.0
## + pitch               1      2584 6918815 2126.6
## 
## Step:  AIC=1993.22
## distance ~ speed_air + aircraft + height
## 
##                      Df Sum of Sq     RSS    AIC
## + no_pasg             1     53331 3532921 1992.2
## <none>                            3586252 1993.2
## + duration_corrected  1     12912 3573340 1994.5
## + pitch               1       174 3586078 1995.2
## 
## Step:  AIC=1992.18
## distance ~ speed_air + aircraft + height + no_pasg
## 
##                      Df Sum of Sq     RSS    AIC
## <none>                            3532921 1992.2
## + duration_corrected  1    9532.8 3523389 1993.6
## + pitch               1     333.8 3532588 1994.2
## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height + no_pasg, 
##     data = faaNoNA)
## 
## Coefficients:
##    (Intercept)       speed_air  aircraftboeing          height  
##      -6248.753          82.131         425.592          13.696  
##        no_pasg  
##         -2.316

I observe that number of passengers is also considered a signicant contributor by automatic selection of variables.

so the final model as obtained here is - model4 i.e.

Distance = - 6263.754 + 82.032( speed_air) + 432.074 (aircraft) + 13.776(height) - 2.041(no_pasg)

From this, we can also conclude that there is no single criteria for selecting a model. Also there is no “best model”. Since, we have chosen AIC as our selection criteria, we see no_pasg also included in our model. However, if I try to include this variable using P_value method, no_pasg will turn out to be non-significant.