Import Required Libraries

rm(list = ls())


library(tidyverse)
library(readxl)
library(MASS)
library(reshape2)

Initial Exploration of the Data


Step 1

# Import datasets
FAA1 <- read_excel("C:/Users/Dhruv Cairae/Desktop/FAA1.xls")
FAA2 <- read_excel("C:/Users/Dhruv Cairae/Desktop/FAA2.xls")

Step 2

str(FAA1)
## tibble [800 x 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:800] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:800] 109 103 NA NA NA ...
##  $ height      : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:800] 3370 2988 1145 1664 1050 ...
str(FAA2)
## tibble [150 x 7] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:150] 109 103 NA NA NA ...
##  $ height      : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:150] 3370 2988 1145 1664 1050 ...

FAA1 has 8 variables and a 800 observations whereas FAA2 has 7 variables and 150 observations. Variables and their class is identical across data sets, the only difference being the absence of duration variables in FAA2

Step 3

FAA <- full_join(FAA1, FAA2)
## Joining, by = c("aircraft", "no_pasg", "speed_ground", "speed_air", "height", "pitch", "distance")

No unique identifier such as flight number exists in both the data frames such as flight number. This creates the complication of an absence of a unique column to join the data frames. The process of full join eliminates rows that have the exact same values across the 7 common variables in both the data frames by default. Its not possible to determine without checking with source of data that the if all values are identical across the two data sets with duration missing, they are indeed identical,there is no time stamp or unique identifier per line item. However for further analysis it is assumed that if all 7 values are common across both data frames, they are indeed identical and analysis shall proceed with 850 observations across 8 variables.

distinct(FAA)
## # A tibble: 850 x 8
##    aircraft duration no_pasg speed_ground speed_air height pitch distance
##    <chr>       <dbl>   <dbl>        <dbl>     <dbl>  <dbl> <dbl>    <dbl>
##  1 boeing       98.5      53        108.       109.   27.4  4.04    3370.
##  2 boeing      126.       69        102.       103.   27.8  4.12    2988.
##  3 boeing      112.       61         71.1       NA    18.6  4.43    1145.
##  4 boeing      197.       56         85.8       NA    30.7  3.88    1664.
##  5 boeing       90.1      70         59.9       NA    32.4  4.03    1050.
##  6 boeing      138.       55         75.0       NA    41.2  4.20    1627.
##  7 boeing       73.0      54         54.4       NA    24.0  3.84     805.
##  8 boeing       52.9      57         57.1       NA    19.4  4.64     574.
##  9 boeing      156.       61         85.4       NA    35.4  4.23    1699.
## 10 boeing      177.       56         61.8       NA    36.7  4.18    1138.
## # ... with 840 more rows
# Remove the redundant FAA1 and FAA2 variables from memory
rm(FAA1, FAA2)

The operation distinct to eliminate duplicates does not remove any rows, therefore there are no duplicates.

Step 4

str(FAA)
## tibble [850 x 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:850] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:850] 109 103 NA NA NA ...
##  $ height      : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:850] 3370 2988 1145 1664 1050 ...
summary(FAA)
##    aircraft            duration         no_pasg      speed_ground   
##  Length:850         Min.   : 14.76   Min.   :29.0   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90  
##  Mode  :character   Median :153.95   Median :60.0   Median : 79.64  
##                     Mean   :154.01   Mean   :60.1   Mean   : 79.45  
##                     3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06  
##                     Max.   :305.62   Max.   :87.0   Max.   :141.22  
##                     NA's   :50                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.25   1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :101.15   Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :103.80   Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:109.40   3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :642

850 observations across 8 variables

Step 5

Observations of negative height, clearly data entry errors, need rectification. Duration has missing observations. Observations with maximum distance and those above the 3rd quantile, should be focused on. Speed Air needs more research due to the large proportion of missing data. At this preliminary stage focus would be on data integrity, obvious outliers or anomalies that appear problematic & values of the response variable that are of significance.

Data Cleaning and Further Exploration


Step 6: Abnormal Value Removal

From the summary from Step 4, there are abnormal observations in the dataset:

  1. there are durations under 40 minutes in the dataset, as well as NAs.
  2. there are observations of speed_ground less than 30 mpg and greater than 140 mph.
  3. speed_air has values above 140 mph and 642 NAs.
  4. height has a minimum less than 6 meters.
  5. distance has value above 6000 feet, but we want to keep this because we are looking for overruns.
# Using table, we can confirm that only airbus and boeing aircraft are present.
FAA %>% dplyr::select(aircraft) %>% table()
## .
## airbus boeing 
##    450    400
# Record the number of precleaning observations
no_pre_clean <- nrow(FAA)

# Delete the abnormal observations from the dataset.
FAA %>% 
  # Remove missing values from the dataset
  filter(!is.na(duration), !is.na(speed_air)) %>% 
  # Remove remaining anomalous observations
  filter(duration > 40,
         between(speed_ground, 30, 140),
         between(speed_air, 30, 140)
  ) -> FAA

# Number of Rows Deleted: 654
no_pre_clean - nrow(FAA)    
## [1] 654

Step 7

# Check structure of dataset and no. of variables
str(FAA)
## tibble [196 x 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:196] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:196] 98.5 125.7 72.3 187.6 233.8 ...
##  $ no_pasg     : num [1:196] 53 69 54 58 69 55 70 53 62 65 ...
##  $ speed_ground: num [1:196] 107.9 101.7 93.4 94 104.8 ...
##  $ speed_air   : num [1:196] 109.3 102.9 92.9 96.2 103.9 ...
##  $ height      : num [1:196] 27.4 27.8 32.2 33.7 43.9 ...
##  $ pitch       : num [1:196] 4.04 4.12 3.82 4.64 3.25 ...
##  $ distance    : num [1:196] 3370 2988 2129 2305 3214 ...
# The structure is unchanged, but now there are only 196 observations of 8
# variables.

# Calculate summary statistics for each variable.
summary(FAA)
##    aircraft            duration        no_pasg       speed_ground   
##  Length:196         Min.   : 45.5   Min.   :41.00   Min.   : 88.69  
##  Class :character   1st Qu.:116.4   1st Qu.:56.00   1st Qu.: 95.30  
##  Mode  :character   Median :149.3   Median :60.00   Median :100.90  
##                     Mean   :150.7   Mean   :59.85   Mean   :103.60  
##                     3rd Qu.:185.0   3rd Qu.:65.00   3rd Qu.:109.65  
##                     Max.   :287.0   Max.   :80.00   Max.   :136.66  
##    speed_air          height           pitch          distance   
##  Min.   : 90.00   Min.   : 9.697   Min.   :2.702   Min.   :1741  
##  1st Qu.: 96.16   1st Qu.:23.409   1st Qu.:3.638   1st Qu.:2163  
##  Median :100.99   Median :29.874   Median :4.087   Median :2534  
##  Mean   :103.67   Mean   :30.430   Mean   :4.044   Mean   :2802  
##  3rd Qu.:109.48   3rd Qu.:36.716   3rd Qu.:4.431   3rd Qu.:3192  
##  Max.   :136.42   Max.   :58.228   Max.   :5.311   Max.   :6310
# Table of aircraft manufacturer values.
FAA %>% dplyr::select(aircraft) %>% table()
## .
## airbus boeing 
##     77    119

Our cleaning hit the data from both manufacturers hard, especially airbus.

Step 8

# Generate histograms for each numeric variable.
# Duration
FAA %>% 
  ggplot() +
  geom_histogram(aes(x = duration)) +
  labs(x = "Duration (minutes)",
       y = "Number of Observations",
       title = "Duration Histogram")

# Number of Passengers
FAA %>% 
  ggplot() +
  geom_histogram(aes(x = no_pasg)) +
  labs(x = "Number of Passengers",
       y = "Number of Observations",
       title = "Number of Passengers Histogram")

# Ground Speed
FAA %>% 
  ggplot() +
  geom_histogram(aes(x = speed_ground)) +
  labs(x = "Ground Speed (MPH)",
       y = "Number of Observations",
       title = "Ground Speed Histogram")

# Air Speed
FAA %>% 
  ggplot() +
  geom_histogram(aes(x = speed_air)) +
  labs(x = "Air Speed (MPH)",
       y = "Number of Observations",
       title = "Air Speed Histogram")

# Height
FAA %>% 
  ggplot() +
  geom_histogram(aes(x = height)) +
  labs(x = "Height (meters)",
       y = "Number of Observations",
       title = "Height Histogram")

# Pitch
FAA %>% 
  ggplot() +
  geom_histogram(aes(x = pitch)) +
  labs(x = "Pitch (°)",
       y = "Number of Observations",
       title = "Pitch Histogram")

# Distance
FAA %>% 
  ggplot() +
  geom_histogram(aes(x = distance)) +
  labs(x = "Distance (feet)",
       y = "Number of Observations",
       title = "Distance Histogram")

Step 9

  1. Distance has a right skewed distribution, with 75% of planes travelling less than 3192 ft.
  2. Air Speed and Ground Speed both have a right tailed distribution.
  3. Planes typically have a height of 30 m (~90 ft) on approach to the runway.
  4. Number of passengers has a symmetric distribution, with a mean of ~60 passengers
  5. Flight duration was on average ~150 minutes, and appears to be symmetrically distributed

Identification of Important Factors for Landing Distance


Step 10

# Converting aircraft to factor and creating a corresponding binary variable -
# aircraft_type.

names(FAA)
## [1] "aircraft"     "duration"     "no_pasg"      "speed_ground" "speed_air"   
## [6] "height"       "pitch"        "distance"
FAA %>%
  mutate(aircraft = as.factor(aircraft)) %>% 
  mutate(aircraft_type = ifelse(aircraft == "boeing", 1, 0)) %>% 
  dplyr::select(aircraft, aircraft_type, duration, no_pasg,
         speed_ground, speed_air, height, pitch, distance) -> FAA

# Computing pairwise correlation between distance and the numeric features. 
# Assigning the resultant table proper column names and adding the direction 
# column as instructed.
table_1 <- data.frame(cor(FAA[2:8], FAA$distance))
colnames(table_1) <- c("coefficient")

table_1 <- tibble::rownames_to_column(table_1, "features") %>% 
  arrange(desc(abs(coefficient))) %>% 
  mutate(direction = ifelse(coefficient < 0, "negative","positive"))

table_1
##        features coefficient direction
## 1     speed_air  0.94529684  positive
## 2  speed_ground  0.93171687  positive
## 3 aircraft_type  0.18154185  positive
## 4        height  0.08568026  positive
## 5         pitch  0.03723416  positive
## 6      duration  0.03671284  positive
## 7       no_pasg -0.01879416  negative

The correlation coefficient is very high between distance and speed_air(0.945) as well as speed_ground(0.932), indicating that flights with higher speed in both air and land have higher landing distance. Speed in air seems to have a slightly higher positive correlation with the landing distance than speed on the ground. The remaining features all have extremely low coefficients (less than 0.1). Out of all features, number of passengers appears to have a very weak negative correlation with landing distance.

Step 11

The scatterplots of landing distance against the other features are plotted using the code below. The scatteplots reveal a seemingly strong linear relationship between distance and speed of the aircraft in both air and on land, which is expected based on the correlation coefficient values calculated earlier. However, the distributions of height, pitch, number of passengers and duration do not reveal a linear relationship with distance. This corroborates the very low correlation coefficient values computed in the previous step.

# Air Speed
p1 <- ggplot(FAA, aes(x = distance, y = speed_air)) +
  geom_point(alpha = 0.9) + 
  labs(y = "speed_air", title = "")

# Ground Speed
p2 <- ggplot(FAA, aes(x = distance, y = speed_ground)) +
  geom_point(alpha = 0.9) + 
  labs(y = "speed_ground", title = "")

# Height
p3 <- ggplot(FAA, aes(x = distance, y = height)) +
  geom_point(alpha = 0.9) + 
  labs(y = "height", title = "")

# Pitch
p4 <- ggplot(FAA, aes(x = distance, y = pitch)) +
  geom_point(alpha = 0.9) + 
  labs(y = "pitch", title = "")

# Duration
p5 <- ggplot(FAA, aes(x = distance, y = duration)) +
  geom_point(alpha = 0.9) + 
  labs(y = "duration", title = "")

# Number of Passengers
p6 <- ggplot(FAA, aes(x = distance, y = no_pasg)) +
  geom_point(alpha = 0.9) + 
  labs(y = "no_pasg", title = "")


gridExtra::grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3)

Step 12

Repeating the variable relationship plots including aircraft type to determine the effect of aircraft carrier in the relationships: The plots reveal the same patterns as observed in the previous Step 11. The scatterplots reveal a linear relationship between distance and speed of the aircraft in both air and on land irrespective of aircraft type. However, the distributions of height, pitch , number of passengers and duration do not reveal a linear relationship with distance for either type of carrier.

# Air Speed
p12 <- ggplot(FAA, aes(x = distance, y = speed_air, color = aircraft)) +
  geom_point(alpha = 0.9) +
  theme_light() +
  labs(x = "distance", y = "speed_air") +
  theme(legend.title = element_blank(), legend.position = c(0.2, 0.8))

# Ground Speed 
p22 <- ggplot(FAA, aes(x = distance, y = speed_ground, color = aircraft)) +
  geom_point(alpha = 0.9) + 
  theme_light() +
  labs(x = "distance", y = "speed_ground") +
  theme(legend.title = element_blank(), legend.position = "none")

# Height
p32 <- ggplot(FAA, aes(x = distance, y = height, color = aircraft)) +
  geom_point(alpha = 0.9) + 
  theme_light() +
  labs(x = "distance", y = "height") +
  theme(legend.title = element_blank(), legend.position = "none")

# Pitch
p42 <- ggplot(FAA, aes(x = distance, y = pitch, color = aircraft)) +
  geom_point(alpha = 0.9) + 
  theme_light() +
  labs(x = "distance", y = "pitch") +
  theme(legend.title = element_blank(), legend.position = "none")

# Duration
p52 <- ggplot(FAA, aes(x = distance, y = duration, color = aircraft)) +
  geom_point(alpha = 0.9) + 
  theme_light() +
  labs(x = "distance", y = "duration") +
  theme(legend.title = element_blank(), legend.position = "none")

# Number of Passengers
p62 <- ggplot(FAA, aes(x = distance, y = no_pasg, color = aircraft)) +
  geom_point(alpha = 0.9) + 
  theme_light() +
  labs(x = "distance", y = "no_pasg") +
  theme(legend.title = element_blank(), legend.position = "none")

gridExtra::grid.arrange(p12, p22, p32, p42, p52, p62, ncol = 3)

Regression Using a Single Factor


Step 13

attach(FAA)

lr1 <- lm(distance ~ no_pasg)
lr2 <- lm(distance ~ duration)
lr3 <- lm(distance ~ height)
lr4 <- lm(distance ~ pitch)
lr5 <- lm(distance ~ aircraft_type)
lr6 <- lm(distance ~ speed_air)
lr7 <- lm(distance ~ speed_ground)

mod <- list(lr1, lr2, lr3, lr4, lr5, lr6, lr7)
r2 <- lapply(mod, FUN = function(lin.mod) summary(lin.mod)$coefficients[8])
r3 <- lapply(mod, FUN = function(lin.mod) summary(lin.mod)$coefficients[2])
r4 <- list('no_pasg','duration','height','pitch','aircraft_type','speed_air','speed_ground')

r <- do.call(rbind, Map(data.frame, features = r4, p_value = r2, Estimate = r3 ))

r %>% 
  mutate(direction = ifelse(Estimate >= 0,'Positive','Negative')) %>% 
  dplyr::select(features,p_value,direction) %>% 
  arrange(p_value) -> table_2

table_2
##        features      p_value direction
## 1     speed_air 2.514701e-96  Positive
## 2  speed_ground 2.836451e-87  Positive
## 3 aircraft_type 1.088014e-02  Positive
## 4        height 2.324592e-01  Positive
## 5         pitch 6.043719e-01  Positive
## 6      duration 6.094459e-01  Positive
## 7       no_pasg 7.937391e-01  Negative

Step 14

FAA %>% mutate_at(c(3:8), ~(scale(.) %>% as.vector)) -> FAA_normalized

attach(FAA_normalized)
## The following objects are masked from FAA:
## 
##     aircraft, aircraft_type, distance, duration, height, no_pasg,
##     pitch, speed_air, speed_ground
lr_1 <- lm(distance ~ no_pasg)
lr_2 <- lm(distance ~duration)
lr_3 <- lm(distance ~ height)
lr_4 <- lm(distance ~pitch)
lr_5 <- lm(distance ~ aircraft_type)
lr_6 <- lm(distance ~ speed_air)
lr_7 <- lm(distance ~ speed_ground)
models_norm <- list(lr_1, lr_2, lr_3,lr_4, lr_5, lr_6, lr_7)
coeff_norm <- lapply(models_norm, FUN = function(lin.mod) summary(lin.mod)$coefficients[2])
variable_name <- list('no_pasg','duration','height','pitch','aircraft_type','speed_air','speed_ground')
tab <- do.call(rbind, Map(data.frame, features = variable_name, Estimate = coeff_norm ))

tab %>% 
  mutate(direction = ifelse(Estimate >= 0,'Positive','Negative')) %>%
  arrange(desc(Estimate)) -> table_3

table_3
##        features  Estimate direction
## 1     speed_air 818.06654  Positive
## 2  speed_ground 806.31435  Positive
## 3 aircraft_type 320.86598  Positive
## 4        height  74.14830  Positive
## 5         pitch  32.22271  Positive
## 6      duration  31.77155  Positive
## 7       no_pasg -16.26460  Negative

Step 15

names(table_1)
## [1] "features"    "coefficient" "direction"
names(table_2)
## [1] "features"  "p_value"   "direction"
names(table_3)
## [1] "features"  "Estimate"  "direction"
merge_1 <- merge(table_1, table_3, by = 'features')
merge_1
##        features coefficient direction.x  Estimate direction.y
## 1 aircraft_type  0.18154185    positive 320.86598    Positive
## 2      duration  0.03671284    positive  31.77155    Positive
## 3        height  0.08568026    positive  74.14830    Positive
## 4       no_pasg -0.01879416    negative -16.26460    Negative
## 5         pitch  0.03723416    positive  32.22271    Positive
## 6     speed_air  0.94529684    positive 818.06654    Positive
## 7  speed_ground  0.93171687    positive 806.31435    Positive
merge_2 <- merge(merge_1, table_2, by = 'features')
merge_2
##        features coefficient direction.x  Estimate direction.y      p_value
## 1 aircraft_type  0.18154185    positive 320.86598    Positive 1.088014e-02
## 2      duration  0.03671284    positive  31.77155    Positive 6.094459e-01
## 3        height  0.08568026    positive  74.14830    Positive 2.324592e-01
## 4       no_pasg -0.01879416    negative -16.26460    Negative 7.937391e-01
## 5         pitch  0.03723416    positive  32.22271    Positive 6.043719e-01
## 6     speed_air  0.94529684    positive 818.06654    Positive 2.514701e-96
## 7  speed_ground  0.93171687    positive 806.31435    Positive 2.836451e-87
##   direction
## 1  Positive
## 2  Positive
## 3  Positive
## 4  Negative
## 5  Positive
## 6  Positive
## 7  Positive
table_0 <- subset(merge_2, select = -c(direction.x,direction.y,direction))
table_0
##        features coefficient  Estimate      p_value
## 1 aircraft_type  0.18154185 320.86598 1.088014e-02
## 2      duration  0.03671284  31.77155 6.094459e-01
## 3        height  0.08568026  74.14830 2.324592e-01
## 4       no_pasg -0.01879416 -16.26460 7.937391e-01
## 5         pitch  0.03723416  32.22271 6.043719e-01
## 6     speed_air  0.94529684 818.06654 2.514701e-96
## 7  speed_ground  0.93171687 806.31435 2.836451e-87
table_0 %>% arrange(desc(coefficient))
##        features coefficient  Estimate      p_value
## 1     speed_air  0.94529684 818.06654 2.514701e-96
## 2  speed_ground  0.93171687 806.31435 2.836451e-87
## 3 aircraft_type  0.18154185 320.86598 1.088014e-02
## 4        height  0.08568026  74.14830 2.324592e-01
## 5         pitch  0.03723416  32.22271 6.043719e-01
## 6      duration  0.03671284  31.77155 6.094459e-01
## 7       no_pasg -0.01879416 -16.26460 7.937391e-01

Looking at correlation coffecients from table_1, and beta coefficients and p-value from table_2 and table_3 we find that speed_air, speed_ground, aircraft_type are relatively important features when compared to other features like height, pitch, duration and no_pasg. The table_0 lists the features in their decreasing order of importance.

Check Collinearity


Step 16

attach(FAA)
## The following objects are masked from FAA_normalized:
## 
##     aircraft, aircraft_type, distance, duration, height, no_pasg,
##     pitch, speed_air, speed_ground
## The following objects are masked from FAA (pos = 4):
## 
##     aircraft, aircraft_type, distance, duration, height, no_pasg,
##     pitch, speed_air, speed_ground
l1 <- lm(distance ~ speed_ground)
l2 <- lm(distance ~ speed_air)
l3 <- lm(distance ~ speed_ground +speed_air)

model <- list(l1,l2,l3)

coeffecient <- lapply(model,function(lin.mod) 
  if (length(lin.mod$coefficients) == 2) lin.mod$coefficients[2]
  else lin.mod$coefficients[2:3])

Yes there is a significant change in sign between speed_ground & speed_air. We need to check correlation between the two variables.

cor(speed_ground,speed_air)
## [1] 0.988969

The two variable are very highly correlated and we need to keep only one for analysis. We will keep the one which has better univariate fitness, i.e., speed_air.

Variable Ranking Based on Table 0


On the basis of table 0 we have 6 important features: speed_air, aircraft_type, height, pitch, duration, and no_pasg.

Step 17

list_imp <- c('speed_air','aircraft_type','height','pitch','duration','no_pasg')

m1 <- lm(distance ~ speed_air)
m2 <- lm(distance ~ speed_air + aircraft_type)
m3 <- lm(distance ~ speed_air + aircraft_type + height)
m4 <- lm(distance ~ speed_air + aircraft_type + height + pitch)
m5 <- lm(distance ~ speed_air + aircraft_type + height + pitch + duration)
m6 <- lm(distance ~ speed_air + aircraft_type + height + pitch + duration + no_pasg)

model <- list(m1, m2, m3, m4, m5, m6)

r_sq <- as.data.frame(lapply(model, function(x) summary(x)$r.squared))

colnames(r_sq) <- c("1", "2", "3", "4", "5", "6")

r_sq <- reshape2::melt(r_sq)
## No id variables; using all as measure variables
r_sq$value <- round(r_sq$value, 4)*100

ggplot(r_sq,aes(x = variable,y = value, group = 1)) +
  geom_path() +
  geom_point(shape = 21, color = "black", fill = "#69b3a2", size = 1) +
  labs(x = "Number of Features", y = "R-square percentage") + 
  geom_text(aes(label = value), size = 4) +
  theme(legend.title = element_blank(), legend.position = "none")

R-square increases or stays the same every time a variable is added. It plateaus after 3 features.

Step 18

r_adj_sq <- as.data.frame(lapply(model,function(x) summary(x)$adj.r.squared))

colnames(r_adj_sq) <- c("1", "2", "3", "4", "5", "6")

r_adj_sq <- reshape2::melt(r_adj_sq)
## No id variables; using all as measure variables
r_adj_sq$value <- round(r_adj_sq$value,5)*100

ggplot(r_adj_sq,aes(x = variable, y = value, group = 1)) + 
  geom_path() +
  geom_point(shape = 21, color = "black", fill = "#69b3a2", size = 1) +
  labs(x = "Number of Features", y = "Adj.R-square percentage") +
  geom_text(aes(label = value), size = 4) +
  theme(legend.title = element_blank(), legend.position = "none")

We observe that the Adjusted R-square values decrease as for each additional feature after the 3rd.

Step 19

aic_value <- as.data.frame(AIC(m1, m2, m3, m4, m5, m6))

print(aic_value)
##    df      AIC
## m1  3 2773.274
## m2  4 2622.261
## m3  5 2492.778
## m4  6 2494.348
## m5  7 2495.985
## m6  8 2496.322

From the AIC model we can see the m3 model is better as it has lowest AIC.

Step 20

From both Adj R-square and AIC we see that model m3 perform better. Features we should select for our model are speed_air, aircraft_type, and height.

Step 21

# Initialize a model using only the intercept
null_model <- lm(distance ~ 1, data = FAA)
summary(null_model)
## 
## Call:
## lm(formula = distance ~ 1, data = FAA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1061.6  -639.7  -268.1   389.6  3507.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2802.48      61.81   45.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 865.4 on 195 degrees of freedom
step <- stepAIC(object = null_model,
               scope = list(lower = null_model, upper = m6),
               direction = "forward")
## Start:  AIC=2652.17
## distance ~ 1
## 
##                 Df Sum of Sq       RSS    AIC
## + speed_air      1 130500410  15540815 2215.1
## + aircraft_type  1   4813145 141228080 2647.6
## <none>                       146041225 2652.2
## + height         1   1072104 144969121 2652.7
## + pitch          1    202469 145838756 2653.9
## + duration       1    196839 145844386 2653.9
## + no_pasg        1     51585 145989640 2654.1
## 
## Step:  AIC=2215.05
## distance ~ speed_air
## 
##                 Df Sum of Sq      RSS    AIC
## + aircraft_type  1   8421618  7119197 2064.0
## + height         1   2947316 12593499 2175.8
## + pitch          1    891909 14648906 2205.5
## <none>                       15540815 2215.1
## + no_pasg        1    115542 15425273 2215.6
## + duration       1      5070 15535745 2217.0
## 
## Step:  AIC=2064.04
## distance ~ speed_air + aircraft_type
## 
##            Df Sum of Sq     RSS    AIC
## + height    1   3479278 3639920 1934.5
## <none>                  7119197 2064.0
## + duration  1     45170 7074028 2064.8
## + no_pasg   1     34898 7084299 2065.1
## + pitch     1     14555 7104642 2065.6
## 
## Step:  AIC=1934.55
## distance ~ speed_air + aircraft_type + height
## 
##            Df Sum of Sq     RSS    AIC
## <none>                  3639920 1934.5
## + no_pasg   1     32041 3607879 1934.8
## + pitch     1      7971 3631948 1936.1
## + duration  1      7299 3632621 1936.2
step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## distance ~ 1
## 
## Final Model:
## distance ~ speed_air + aircraft_type + height
## 
## 
##              Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                                    195  146041225 2652.172
## 2     + speed_air  1 130500410       194   15540815 2215.050
## 3 + aircraft_type  1   8421618       193    7119197 2064.037
## 4        + height  1   3479278       192    3639920 1934.554

Step AIC chooses distance ~ speed_air + aircraft_type + height as its final model. In other words, m3.