BANA 7042 PROJECT

INTRODUCTION

Background: Flight landing.

Motivation: To reduce the risk of landing overrun.

Goal: To study what factors and how they would impact the landing distance of a commercial flight.

Data: Landing data (landing distance and other parameters) from 950 commercial flights (not real data set but simulated from statistical models).

Variable dictionary:

Variable_Name	Description
Aircraft	The make of an aircraft (Boeing or Airbus)
Duration (in minutes)	Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min
No_pasg	The number of passengers in a flight
Speed_ground (in miles per hour)	The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal
Speed_air (in miles per hour)	The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal
Height (in meters)	The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway
Pitch (in degrees)	Pitch angle of an aircraft when it is passing over the threshold of the runway
Distance (in feet)	The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet

Loading the packages required:

library(readxl)
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(hrbrthemes)

INITIAL EXPLORATION

Loading the dataset and checking the structure of the datasets:

FAA1 <- read_xls("C:\\Users\\arunp\\Desktop\\UC\\ACADEMICS\\SEMESTER 2\\7042-STATISTICAL MODELLING\\FAA1.xls")
FAA2 <- read_xls("C:\\Users\\arunp\\Desktop\\UC\\ACADEMICS\\SEMESTER 2\\7042-STATISTICAL MODELLING\\FAA2.xls")
#structure of data
str(FAA1)

## tibble [800 x 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:800] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:800] 109 103 NA NA NA ...
##  $ height      : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:800] 3370 2988 1145 1664 1050 ...

str(FAA2)

## tibble [150 x 7] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:150] 109 103 NA NA NA ...
##  $ height      : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:150] 3370 2988 1145 1664 1050 ...

The dataset FAA1 has got 800 observations and 8 variables.The dataset FAA2 has got 150 observations and 7 variables. FAA2 dataset doesn’t contain the information about duration of flights.

Merging the two datasets into a single dataset:

#merging datasets
FAA_merged <- bind_rows(FAA1,FAA2)

Let us now check for duplicate observations in the data.

#check for duplicates
duplicate_obs1 <- duplicated(FAA_merged %>% select(-duration))
print(paste("There are" ,sum(duplicate_obs1),"duplicate observations in the dataset."))

## [1] "There are 100 duplicate observations in the dataset."

We will drop these duplicate observations from our dataset.

#dropping duplicates
duplicates <- duplicate_obs1 %>% which()
FAA <- FAA_merged[-duplicates,]

Now, let us check the structure of the dataset again.

str(FAA)

## tibble [850 x 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:850] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:850] 109 103 NA NA NA ...
##  $ height      : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:850] 3370 2988 1145 1664 1050 ...

There are 850 observations and 8 variables in our dataset.

Let us check the summary statistics of each variable to have a better understanding.

summary(FAA)

##    aircraft            duration         no_pasg      speed_ground   
##  Length:850         Min.   : 14.76   Min.   :29.0   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90  
##  Mode  :character   Median :153.95   Median :60.0   Median : 79.64  
##                     Mean   :154.01   Mean   :60.1   Mean   : 79.45  
##                     3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06  
##                     Max.   :305.62   Max.   :87.0   Max.   :141.22  
##                     NA's   :50                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.25   1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :101.15   Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :103.80   Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:109.40   3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :642

Initial Summary of variables

Around 75% of the observations have a missing value of speed_air variable which should be of a concern.
There are around 50 missing values in the variable duration which has to be taken care of. Also, the minimum duration is shown as 14.76 min which appears to be an outlier compared to the other values.
The maximum value of distance is very high compared to the other values. Also, the median and mean of the variable distance differ substantially indicating a skewed distribution.
There is a negative value for height variable which should be a wrong observation.
There were 100 duplicate observations in the dataset which were removed.

DATA CLEANING

From the dictionary of the dataset, we can see the description for normal values of each variable. Let us check for any abnormal values in the dataset and remove them.

attach(FAA)
FAA_normal <- FAA %>%  
              filter((duration > 40| is.na(duration)) & (speed_ground >= 30) & (speed_ground <= 140) &
                       ((speed_air >= 30) & (speed_air <= 140)|is.na(speed_air)) &
           (height >= 6) & (distance < 6000))

dim(FAA_normal)

## [1] 831   8

There are 831 observations with all the variables as normal values. We will remove the 19 abnormal observations.

#removing abnormal values
FAA <- FAA_normal

Since the variable speed_air contains almost 75% of missing data, imputing the data would not be a good idea.So, we are not imputing the missing values in the variable speed_air. But the variable duration contains only less number of missing values with respect to the sample size. We will impute the null values in the variable duration with mean value as the variable is of a main concern for us.

#imputing duration
FAA <- transform(FAA, duration = ifelse(is.na(duration), mean(duration, na.rm=TRUE),duration))

Now, let us check for structure and summary of the cleaned dataset.

str(FAA)

## 'data.frame':    831 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

The dataset contains 831 observations and 8 variables.

Let us check the summary of the variables.

summary(FAA)

##    aircraft            duration         no_pasg       speed_ground   
##  Length:831         Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
##  Class :character   1st Qu.:122.67   1st Qu.:55.00   1st Qu.: 66.20  
##  Mode  :character   Median :154.78   Median :60.00   Median : 79.79  
##                     Mean   :154.78   Mean   :60.06   Mean   : 79.54  
##                     3rd Qu.:186.37   3rd Qu.:65.00   3rd Qu.: 91.91  
##                     Max.   :305.62   Max.   :87.00   Max.   :132.78  
##                                                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.23   1st Qu.:23.530   1st Qu.:3.640   1st Qu.: 893.28  
##  Median :101.12   Median :30.167   Median :4.001   Median :1262.15  
##  Mean   :103.48   Mean   :30.458   Mean   :4.005   Mean   :1522.48  
##  3rd Qu.:109.36   3rd Qu.:37.004   3rd Qu.:4.370   3rd Qu.:1936.63  
##  Max.   :132.91   Max.   :59.946   Max.   :5.927   Max.   :5381.96  
##  NA's   :628

Let us plot the histograms for all numerical variables to have a better understanding about their distribution.

FAA_numeric <- FAA %>% keep(is.numeric)
FAA_numeric %>%  
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

All histograms except speed_air and distance shows normal distribution.

Summary after Data Cleaning

There were 19 abnormal variable values which were removed from the dataset.
The distribution of speed_air appears to be skewed.
The distance distribution also shows a skewed distribution.
All other variables appear to be normally distributed.
The 50 NULL values in the variable duration were imputed with the mean value.

LINEAR REGRESSION

INITIAL ANALYSIS

Computing the pairwise correlation between the landing distance and each factor X:

cor(FAA_numeric$distance,FAA_numeric,use = "pairwise.complete.obs")

##         duration     no_pasg speed_ground speed_air     height      pitch
## [1,] -0.05026941 -0.01775663    0.8662438 0.9420971 0.09941121 0.08702846
##      distance
## [1,]        1

Table 1

Variable_Name	Size_of_Correlation	Direction_of_correlation
Speed_Air	0.9420971	Positive
Speed_Ground	0.8662438	Positive
Height	0.0994112	Positive
Pitch	0.0870285	Positive
Duration	0.0502694	Negative
No. of Passengers	0.0177566	Negative

Let us check for the scatter plots between distance and all other variables.

FAA_numeric %>%
  gather(-distance, key = "var", value = "value") %>%
  ggplot(aes(x = value, y = distance)) +
  geom_point() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

The scatter plot results are in consistent with the correlation values we observed in Table 1.

As the airplane make is a categorical varaible, we did not include the variable in the initial analysis. We will code this character variable as 1 for the make “Boeing” and 0 for the make “Airbus” and include them in the analysis.

FAA <- transform(FAA, aircraft = ifelse(FAA$aircraft=="boeing",1,0))
cor(FAA$distance,FAA,use = "pairwise.complete.obs")

##       aircraft    duration     no_pasg speed_ground speed_air     height
## [1,] 0.2381445 -0.05026941 -0.01775663    0.8662438 0.9420971 0.09941121
##           pitch distance
## [1,] 0.08702846        1

Let us also update Table 1 with the new variable:

Table 1(updated)

Variable_Name	Size_of_Correlation	Direction_of_correlation
Speed_Air	0.9420971	Positive
Speed_Ground	0.8662438	Positive
Aircraft	0.2381445	Positive
Height	0.0994112	Positive
Pitch	0.0870285	Positive
Duration	0.0502694	Negative
No. of Passengers	0.0177566	Negative

REGRESSION USING SINGLE FACTOR

Let us regress Y (landing distance) on each of the X variables and create a summary table with the findings.

model_aircraft <- lm(distance ~ aircraft)
model_duration <- lm(distance ~ duration)
model_nopsg <- lm(distance ~ no_pasg)
model_speed_grd <- lm(distance ~ speed_ground)
model_speed_air <- lm(distance ~ speed_air)
model_height <- lm(distance ~ height)
model_pitch <- lm(distance ~ pitch)
summary(model_aircraft)$coef[2,4]
summary(model_duration)$coef[2,4]
summary(model_nopsg)$coef[2,4]
summary(model_speed_grd)$coef[2,4]
summary(model_speed_air)$coef[2,4]
summary(model_height)$coef[2,4]
summary(model_pitch)$coef[2,4]

summary(model_aircraft)$coef[2,1]
summary(model_duration)$coef[2,1]
summary(model_nopsg)$coef[2,1]
summary(model_speed_grd)$coef[2,1]
summary(model_speed_air)$coef[2,1]
summary(model_height)$coef[2,1]
summary(model_pitch)$coef[2,1]

Table 2

Variable_Name	Size_of_pvalue	Direction_of_Regressioncoeff
Speed_Ground	0.0000000	Positive
Speed_Air	0.0000000	Positive
Aircraft	0.0000000	Positive
Height	0.0041239	Positive
Pitch	0.0120812	Positive
Duration	0.1514002	Negative
No. of Passengers	0.6092520	Negative

We will now standardize each X variable. In other words,we will create a new variable: X’= {X-mean(X)}/sd(X)}

FAA_std <- FAA
attach(FAA_std)
FAA_std <- transform(FAA_std, aircraft = (aircraft-mean(aircraft))/sd(aircraft))
FAA_std <- transform(FAA_std, duration = (duration-mean(duration))/sd(duration))
FAA_std <- transform(FAA_std, no_pasg = (no_pasg-mean(no_pasg))/sd(no_pasg))
FAA_std <- transform(FAA_std, speed_ground = (speed_ground-mean(speed_ground))/sd(speed_ground))
FAA_std <- transform(FAA_std, speed_air = (speed_air-mean(speed_air,na.rm = TRUE))/sd(speed_air,na.rm = TRUE))
FAA_std <- transform(FAA_std, height = (height-mean(height))/sd(height))
FAA_std <- transform(FAA_std, pitch = (pitch-mean(pitch))/sd(pitch))

We will now regress Y (landing distance) on each of these X’ variables and create a summary table with the findings.

model1_aircraft <- lm(FAA_std$distance ~ FAA_std$aircraft)
model1_duration <- lm(FAA_std$distance ~ FAA_std$duration)
model1_nopsg <- lm(FAA_std$distance ~ FAA_std$no_pasg)
model1_speed_grd <- lm(FAA_std$distance ~ FAA_std$speed_ground)
model1_speed_air <- lm(FAA_std$distance ~ FAA_std$speed_air)
model1_height <- lm(FAA_std$distance ~ FAA_std$height)
model1_pitch <- lm(FAA_std$distance ~ FAA_std$pitch)
summary(model1_aircraft)$coef[2,1]
summary(model1_duration)$coef[2,1]
summary(model1_nopsg)$coef[2,1]
summary(model1_speed_grd)$coef[2,1]
summary(model1_speed_air)$coef[2,1]
summary(model1_height)$coef[2,1]
summary(model1_pitch)$coef[2,1]

Table 3

Variable_Name	Size_of_Regressioncoeff	Direction_of_Regressioncoeff
Speed_Ground	776.44740	Positive
Speed_Air	774.34650	Positive
Aircraft	213.45800	Positive
Height	89.10606	Positive
Pitch	78.00693	Positive
Duration	45.05839	Negative
No. of Passengers	15.91595	Negative

We can observe that the results from the Tables 1,2 and 3 are almost consistent.Only the correlation table shows a higher value for speed_air than speed_ground. All other rankings are in the same order.

Let us provide a single table that ranks all the factors based on their relative importance in determining the landing distance.

Table 0

Variable_Name	Size_of_Regressioncoeff	Size_of_pvalue	Correlation_coeff
Speed_Ground	776.44740	0.0000000	0.8662438
Speed_Air	774.34650	0.0000000	0.9420971
Aircraft	213.45800	0.0000000	0.2381445
Height	89.10606	0.0041239	0.0994112
Pitch	78.00693	0.0120812	0.0870285
Duration	45.05839	0.1514002	0.0502694
No. of Passengers	15.91595	0.6092520	0.0177566

COLLINEARITY CHECK

Let us check three different models for collinearity:

#model1
summary(model_speed_grd)

## 
## Call:
## lm(formula = distance ~ speed_ground)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -944.71 -328.81  -79.08  209.60 2413.21 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1810.5017    69.2987  -26.13   <2e-16 ***
## speed_ground    41.9940     0.8482   49.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 471 on 848 degrees of freedom
## Multiple R-squared:  0.743,  Adjusted R-squared:  0.7427 
## F-statistic:  2451 on 1 and 848 DF,  p-value: < 2.2e-16

#model2
summary(model_speed_air)

## 
## Call:
## lm(formula = distance ~ speed_air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -798.35 -193.18    3.68  216.25  817.79 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5718.404    201.994  -28.31   <2e-16 ***
## speed_air      82.175      1.937   42.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 285.9 on 206 degrees of freedom
##   (642 observations deleted due to missingness)
## Multiple R-squared:  0.8973, Adjusted R-squared:  0.8968 
## F-statistic:  1800 on 1 and 206 DF,  p-value: < 2.2e-16

#model3
attach(FAA)
model3 <- lm(distance ~ speed_ground + speed_air)
summary(model3)

## 
## Call:
## lm(formula = distance ~ speed_ground + speed_air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -819.74 -202.02    3.52  211.25  636.25 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5462.28     207.48 -26.327  < 2e-16 ***
## speed_ground   -14.37      12.68  -1.133    0.258    
## speed_air       93.96      12.89   7.291 6.99e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 200 degrees of freedom
##   (628 observations deleted due to missingness)
## Multiple R-squared:  0.8883, Adjusted R-squared:  0.8871 
## F-statistic:   795 on 2 and 200 DF,  p-value: < 2.2e-16

We can observe that when we regress distance with both the variables, speed_ground becomes statistically insignificant.Also, the coefficient estimate of speed_ground becomes negative. This may be due to the high correlation between the two variables: speed_ground and speed_air.

Let us confirm this by checking the correlation.

cor(speed_air,speed_ground,use = "complete.obs")

## [1] 0.9879383

The result confirms the high correlation between speed_air and speed_ground. So, we will include either one of these variables in the model.Even though , R-squared value is more for the model with speed_air, as there are a lot of NULL values in the speed_air variable, I would keep only speed_ground in the model.

VARIABLE SELECTION

We would now build different models as per their ranking in Table 0.

attach(FAA)
model1 <- lm(distance ~ speed_ground)
model2 <- lm(distance ~ speed_ground + aircraft)
model3 <- lm(distance ~ speed_ground + aircraft + height)
model4 <- lm(distance ~ speed_ground + aircraft + height + pitch)
model5 <- lm(distance ~ speed_ground + aircraft + height + pitch + no_pasg)
model6 <- lm(distance ~ speed_ground + aircraft + height + pitch +  no_pasg + duration)

Let us check how good the model is by checking R-squared of each model.

model1_rsquared <- summary(model1)$r.squared
model2_rsquared <- summary(model2)$r.squared
model3_rsquared <- summary(model3)$r.squared
model4_rsquared <- summary(model4)$r.squared
model5_rsquared <- summary(model5)$r.squared
model6_rsquared <- summary(model6)$r.squared

rsquared <- cbind(c(model1_rsquared,model2_rsquared,model3_rsquared,model4_rsquared,model5_rsquared,model6_rsquared),1:6)
rsquared <- as.data.frame(rsquared)
names(rsquared) <- c("Rsquared","Variables")
rsquared

##    Rsquared Variables
## 1 0.7503784         1
## 2 0.8251319         2
## 3 0.8488989         3
## 4 0.8493717         4
## 5 0.8497100         5
## 6 0.8497162         6

Let us check how R-squared varies with different models as we increase the number of predictors.

rsquared %>%
  ggplot(aes(x= Variables, y= Rsquared)) +
  geom_line( color="grey") +
  geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
  scale_x_continuous("Variables", labels = as.character(rsquared$Variables), breaks = rsquared$Variables) +
  theme_ipsum() +
  ggtitle("R-squared with number of variables")

We can see that R-squared value increases as the number of variables increases. But after number of variables reach 3,the difference in R-squared is very less that the value almost remains constant.

When we are doing Multiple regression, Adjusted R-squared can be a better guide than R-squared as it even penalizes the model for adding more variables. Let us check the adjusted R-squared values for different models.

model1_adjrsquared <- summary(model1)$adj.r.squared
model2_adjrsquared <- summary(model2)$adj.r.squared
model3_adjrsquared <- summary(model3)$adj.r.squared
model4_adjrsquared <- summary(model4)$adj.r.squared
model5_adjrsquared <- summary(model5)$adj.r.squared
model6_adjrsquared <- summary(model6)$adj.r.squared
adjrsquared <- cbind(c(model1_adjrsquared,model2_adjrsquared,model3_adjrsquared,model4_adjrsquared,model5_adjrsquared,model6_adjrsquared),1:6)
adjrsquared <- as.data.frame(adjrsquared)
names(adjrsquared) <- c("AdjRsquared","Variables")
adjrsquared

##   AdjRsquared Variables
## 1   0.7500773         1
## 2   0.8247095         2
## 3   0.8483508         3
## 4   0.8486423         4
## 5   0.8487992         5
## 6   0.8486219         6

Let us plot the adjusted R-squared values against the number of variables.

adjrsquared %>%
  ggplot(aes(x= Variables, y= AdjRsquared)) +
  geom_line( color="grey") +
  geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
  scale_x_continuous("Variables", labels = as.character(adjrsquared$Variables), breaks = adjrsquared$Variables) +
  theme_ipsum() +
  ggtitle("Adjusted R-squared with number of variables")

Adjusted R-squared also behaves almost similar to R-squared. The value increases upto 3 variables and then appears to remain constant.

AIC can also be a good indicator of a good model. Let us check for the AIC values for different models.

model1_aic <- AIC(model1)
model2_aic <- AIC(model2)
model3_aic <- AIC(model3)
model4_aic <- AIC(model4)
model5_aic <- AIC(model5)
model6_aic <- AIC(model6)
aic <- cbind(c(model1_aic,model2_aic,model3_aic,model4_aic,model5_aic,model6_aic),1:6)
aic <- as.data.frame(aic)
names(aic) <- c("AIC","Variables")
aic

##        AIC Variables
## 1 12508.81         1
## 2 12215.05         2
## 3 12095.65         3
## 4 12095.05         4
## 5 12095.18         5
## 6 12097.14         6

Let us plot the variation in AIC values against the number of variables.

aic %>%
  ggplot(aes(x= Variables, y= AIC)) +
  geom_line( color="grey") +
  geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
  scale_x_continuous("Variables", labels = as.character(aic$Variables), breaks = aic$Variables) +
  theme_ipsum() +
  ggtitle("AIC with number of variables")

We know that a better model has a lower AIC value. Here, AIC value decreases and even though the least value is for the model with number of variables 4, after the number of variables reach 3, the difference between the AIC values is minimal. After 3, the graph appears to level off.

So, observing the results from adjusted R-squared and AIC values , I would select the model with 3 variables i.e. regression of distance over the variables speed_ground,aircraft and height.Let us check the performance of this model.

summary(model3)

## 
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -711.95 -226.73  -90.17  130.04 1471.84 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2512.2433    68.1974  -36.84   <2e-16 ***
## speed_ground    42.4024     0.6483   65.41   <2e-16 ***
## aircraft       496.0452    24.2975   20.41   <2e-16 ***
## height          14.1478     1.2405   11.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 349.1 on 827 degrees of freedom
## Multiple R-squared:  0.8489, Adjusted R-squared:  0.8484 
## F-statistic:  1549 on 3 and 827 DF,  p-value: < 2.2e-16

model3_aic

## [1] 12095.65

We can find that the adjusted R-squared value is 0.8484 which means the model can explain around 85% of variation in the response variable.

AUTOMATE ALGORITHM

We will now use the “stepAIC” function to find out the model it suggests through forward variable selection.

library(MASS)
FAA_step <- FAA[,-5]
m1 <- lm(distance ~ 1,FAA_step)
m2 <- lm(distance~ . ,FAA_step)
f <- stepAIC(m1, direction="forward", scope=list(lower=m1, upper=m2))

## Start:  AIC=11299.8
## distance ~ 1
## 
##                Df Sum of Sq       RSS   AIC
## + speed_ground  1 500382567 166457762 10148
## + aircraft      1  37818390 629021939 11253
## + height        1   6590108 660250221 11294
## + pitch         1   5050617 661789712 11296
## + duration      1   1685114 665155215 11300
## <none>                      666840329 11300
## + no_pasg       1    210253 666630076 11302
## 
## Step:  AIC=10148.53
## distance ~ speed_ground
## 
##            Df Sum of Sq       RSS     AIC
## + aircraft  1  49848656 116609106  9854.8
## + height    1  14916377 151541385 10072.5
## + pitch     1   9765095 156692668 10100.3
## <none>                  166457762 10148.5
## + no_pasg   1    207528 166250234 10149.5
## + duration  1     51669 166406094 10150.3
## 
## Step:  AIC=9854.77
## distance ~ speed_ground + aircraft
## 
##            Df Sum of Sq       RSS    AIC
## + height    1  15848830 100760276 9735.4
## + pitch     1    455453 116153653 9853.5
## <none>                  116609106 9854.8
## + no_pasg   1     87171 116521935 9856.1
## + duration  1      8445 116600661 9856.7
## 
## Step:  AIC=9735.37
## distance ~ speed_ground + aircraft + height
## 
##            Df Sum of Sq       RSS    AIC
## + pitch     1    315259 100445017 9734.8
## <none>                  100760276 9735.4
## + no_pasg   1    232003 100528273 9735.5
## + duration  1      3976 100756300 9737.3
## 
## Step:  AIC=9734.77
## distance ~ speed_ground + aircraft + height + pitch
## 
##            Df Sum of Sq       RSS    AIC
## <none>                  100445017 9734.8
## + no_pasg   1    225608 100219409 9734.9
## + duration  1      6696 100438321 9736.7

We can see that stepAIC suggests a regression model with the 4 variables(distance ~ speed_ground + aircraft + height + pitch).This is in consistence with what we observed from the graph for AIC values. The model with 4 variables have got the lowest AIC value.

Let us check for the summary of the regression model with these variables.

summary(lm(distance ~ speed_ground + aircraft + height + pitch))

## 
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height + pitch)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -716.81 -224.12  -93.24  127.80 1500.95 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2664.3223   116.4605  -22.88   <2e-16 ***
## speed_ground    42.4283     0.6479   65.49   <2e-16 ***
## aircraft       481.2682    25.9512   18.55   <2e-16 ***
## height          14.0909     1.2398   11.37   <2e-16 ***
## pitch           39.6076    24.5991    1.61    0.108    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 348.7 on 826 degrees of freedom
## Multiple R-squared:  0.8494, Adjusted R-squared:  0.8486 
## F-statistic:  1164 on 4 and 826 DF,  p-value: < 2.2e-16

We can observe that the variable pitch is statistically insignificant in the model. So as per our conclusions from the Adjusted Rsquared, I would consider the model with 3 variables i.e. regression of distance over the variables speed_ground,aircraft and height as the final model.