rm(list = ls())
library(tidyverse)
library(readxl)
library(MASS)
library(reshape2)
# Import datasets
FAA1 <- read_excel("C:/Users/Dhruv Cairae/Desktop/FAA1.xls")
FAA2 <- read_excel("C:/Users/Dhruv Cairae/Desktop/FAA2.xls")
str(FAA1)
## tibble [800 x 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:800] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:800] 109 103 NA NA NA ...
## $ height : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:800] 3370 2988 1145 1664 1050 ...
str(FAA2)
## tibble [150 x 7] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:150] 109 103 NA NA NA ...
## $ height : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:150] 3370 2988 1145 1664 1050 ...
FAA1 has 8 variables and a 800 observations whereas FAA2 has 7 variables and 150 observations. Variables and their class is identical across data sets, the only difference being the absence of duration variables in FAA2
FAA <- full_join(FAA1, FAA2)
## Joining, by = c("aircraft", "no_pasg", "speed_ground", "speed_air", "height", "pitch", "distance")
No unique identifier such as flight number exists in both the data frames such as flight number. This creates the complication of an absence of a unique column to join the data frames. The process of full join eliminates rows that have the exact same values across the 7 common variables in both the data frames by default. Its not possible to determine without checking with source of data that the if all values are identical across the two data sets with duration missing, they are indeed identical,there is no time stamp or unique identifier per line item. However for further analysis it is assumed that if all 7 values are common across both data frames, they are indeed identical and analysis shall proceed with 850 observations across 8 variables.
distinct(FAA)
## # A tibble: 850 x 8
## aircraft duration no_pasg speed_ground speed_air height pitch distance
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 boeing 98.5 53 108. 109. 27.4 4.04 3370.
## 2 boeing 126. 69 102. 103. 27.8 4.12 2988.
## 3 boeing 112. 61 71.1 NA 18.6 4.43 1145.
## 4 boeing 197. 56 85.8 NA 30.7 3.88 1664.
## 5 boeing 90.1 70 59.9 NA 32.4 4.03 1050.
## 6 boeing 138. 55 75.0 NA 41.2 4.20 1627.
## 7 boeing 73.0 54 54.4 NA 24.0 3.84 805.
## 8 boeing 52.9 57 57.1 NA 19.4 4.64 574.
## 9 boeing 156. 61 85.4 NA 35.4 4.23 1699.
## 10 boeing 177. 56 61.8 NA 36.7 4.18 1138.
## # ... with 840 more rows
# Remove the redundant FAA1 and FAA2 variables from memory
rm(FAA1, FAA2)
The operation distinct to eliminate duplicates does not remove any rows, therefore there are no duplicates.
str(FAA)
## tibble [850 x 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:850] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:850] 109 103 NA NA NA ...
## $ height : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:850] 3370 2988 1145 1664 1050 ...
summary(FAA)
## aircraft duration no_pasg speed_ground
## Length:850 Min. : 14.76 Min. :29.0 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90
## Mode :character Median :153.95 Median :60.0 Median : 79.64
## Mean :154.01 Mean :60.1 Mean : 79.45
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06
## Max. :305.62 Max. :87.0 Max. :141.22
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :101.15 Median :30.093 Median :4.008 Median :1258.09
## Mean :103.80 Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :642
850 observations across 8 variables
Observations of negative height, clearly data entry errors, need rectification. Duration has missing observations. Observations with maximum distance and those above the 3rd quantile, should be focused on. Speed Air needs more research due to the large proportion of missing data. At this preliminary stage focus would be on data integrity, obvious outliers or anomalies that appear problematic & values of the response variable that are of significance.
From the summary from Step 4, there are abnormal observations in the dataset:
# Using table, we can confirm that only airbus and boeing aircraft are present.
FAA %>% dplyr::select(aircraft) %>% table()
## .
## airbus boeing
## 450 400
# Record the number of precleaning observations
no_pre_clean <- nrow(FAA)
# Delete the abnormal observations from the dataset.
FAA %>%
# Remove missing values from the dataset
filter(!is.na(duration), !is.na(speed_air)) %>%
# Remove remaining anomalous observations
filter(duration > 40,
between(speed_ground, 30, 140),
between(speed_air, 30, 140)
) -> FAA
# Number of Rows Deleted: 654
no_pre_clean - nrow(FAA)
## [1] 654
# Check structure of dataset and no. of variables
str(FAA)
## tibble [196 x 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:196] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:196] 98.5 125.7 72.3 187.6 233.8 ...
## $ no_pasg : num [1:196] 53 69 54 58 69 55 70 53 62 65 ...
## $ speed_ground: num [1:196] 107.9 101.7 93.4 94 104.8 ...
## $ speed_air : num [1:196] 109.3 102.9 92.9 96.2 103.9 ...
## $ height : num [1:196] 27.4 27.8 32.2 33.7 43.9 ...
## $ pitch : num [1:196] 4.04 4.12 3.82 4.64 3.25 ...
## $ distance : num [1:196] 3370 2988 2129 2305 3214 ...
# The structure is unchanged, but now there are only 196 observations of 8
# variables.
# Calculate summary statistics for each variable.
summary(FAA)
## aircraft duration no_pasg speed_ground
## Length:196 Min. : 45.5 Min. :41.00 Min. : 88.69
## Class :character 1st Qu.:116.4 1st Qu.:56.00 1st Qu.: 95.30
## Mode :character Median :149.3 Median :60.00 Median :100.90
## Mean :150.7 Mean :59.85 Mean :103.60
## 3rd Qu.:185.0 3rd Qu.:65.00 3rd Qu.:109.65
## Max. :287.0 Max. :80.00 Max. :136.66
## speed_air height pitch distance
## Min. : 90.00 Min. : 9.697 Min. :2.702 Min. :1741
## 1st Qu.: 96.16 1st Qu.:23.409 1st Qu.:3.638 1st Qu.:2163
## Median :100.99 Median :29.874 Median :4.087 Median :2534
## Mean :103.67 Mean :30.430 Mean :4.044 Mean :2802
## 3rd Qu.:109.48 3rd Qu.:36.716 3rd Qu.:4.431 3rd Qu.:3192
## Max. :136.42 Max. :58.228 Max. :5.311 Max. :6310
# Table of aircraft manufacturer values.
FAA %>% dplyr::select(aircraft) %>% table()
## .
## airbus boeing
## 77 119
Our cleaning hit the data from both manufacturers hard, especially airbus.
# Generate histograms for each numeric variable.
# Duration
FAA %>%
ggplot() +
geom_histogram(aes(x = duration)) +
labs(x = "Duration (minutes)",
y = "Number of Observations",
title = "Duration Histogram")
# Number of Passengers
FAA %>%
ggplot() +
geom_histogram(aes(x = no_pasg)) +
labs(x = "Number of Passengers",
y = "Number of Observations",
title = "Number of Passengers Histogram")
# Ground Speed
FAA %>%
ggplot() +
geom_histogram(aes(x = speed_ground)) +
labs(x = "Ground Speed (MPH)",
y = "Number of Observations",
title = "Ground Speed Histogram")
# Air Speed
FAA %>%
ggplot() +
geom_histogram(aes(x = speed_air)) +
labs(x = "Air Speed (MPH)",
y = "Number of Observations",
title = "Air Speed Histogram")
# Height
FAA %>%
ggplot() +
geom_histogram(aes(x = height)) +
labs(x = "Height (meters)",
y = "Number of Observations",
title = "Height Histogram")
# Pitch
FAA %>%
ggplot() +
geom_histogram(aes(x = pitch)) +
labs(x = "Pitch (°)",
y = "Number of Observations",
title = "Pitch Histogram")
# Distance
FAA %>%
ggplot() +
geom_histogram(aes(x = distance)) +
labs(x = "Distance (feet)",
y = "Number of Observations",
title = "Distance Histogram")
# Converting aircraft to factor and creating a corresponding binary variable -
# aircraft_type.
names(FAA)
## [1] "aircraft" "duration" "no_pasg" "speed_ground" "speed_air"
## [6] "height" "pitch" "distance"
FAA %>%
mutate(aircraft = as.factor(aircraft)) %>%
mutate(aircraft_type = ifelse(aircraft == "boeing", 1, 0)) %>%
dplyr::select(aircraft, aircraft_type, duration, no_pasg,
speed_ground, speed_air, height, pitch, distance) -> FAA
# Computing pairwise correlation between distance and the numeric features.
# Assigning the resultant table proper column names and adding the direction
# column as instructed.
table_1 <- data.frame(cor(FAA[2:8], FAA$distance))
colnames(table_1) <- c("coefficient")
table_1 <- tibble::rownames_to_column(table_1, "features") %>%
arrange(desc(abs(coefficient))) %>%
mutate(direction = ifelse(coefficient < 0, "negative","positive"))
table_1
## features coefficient direction
## 1 speed_air 0.94529684 positive
## 2 speed_ground 0.93171687 positive
## 3 aircraft_type 0.18154185 positive
## 4 height 0.08568026 positive
## 5 pitch 0.03723416 positive
## 6 duration 0.03671284 positive
## 7 no_pasg -0.01879416 negative
The correlation coefficient is very high between distance and speed_air(0.945) as well as speed_ground(0.932), indicating that flights with higher speed in both air and land have higher landing distance. Speed in air seems to have a slightly higher positive correlation with the landing distance than speed on the ground. The remaining features all have extremely low coefficients (less than 0.1). Out of all features, number of passengers appears to have a very weak negative correlation with landing distance.
The scatterplots of landing distance against the other features are plotted using the code below. The scatteplots reveal a seemingly strong linear relationship between distance and speed of the aircraft in both air and on land, which is expected based on the correlation coefficient values calculated earlier. However, the distributions of height, pitch, number of passengers and duration do not reveal a linear relationship with distance. This corroborates the very low correlation coefficient values computed in the previous step.
# Air Speed
p1 <- ggplot(FAA, aes(x = distance, y = speed_air)) +
geom_point(alpha = 0.9) +
labs(y = "speed_air", title = "")
# Ground Speed
p2 <- ggplot(FAA, aes(x = distance, y = speed_ground)) +
geom_point(alpha = 0.9) +
labs(y = "speed_ground", title = "")
# Height
p3 <- ggplot(FAA, aes(x = distance, y = height)) +
geom_point(alpha = 0.9) +
labs(y = "height", title = "")
# Pitch
p4 <- ggplot(FAA, aes(x = distance, y = pitch)) +
geom_point(alpha = 0.9) +
labs(y = "pitch", title = "")
# Duration
p5 <- ggplot(FAA, aes(x = distance, y = duration)) +
geom_point(alpha = 0.9) +
labs(y = "duration", title = "")
# Number of Passengers
p6 <- ggplot(FAA, aes(x = distance, y = no_pasg)) +
geom_point(alpha = 0.9) +
labs(y = "no_pasg", title = "")
gridExtra::grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3)
Repeating the variable relationship plots including aircraft type to determine the effect of aircraft carrier in the relationships: The plots reveal the same patterns as observed in the previous Step 11. The scatterplots reveal a linear relationship between distance and speed of the aircraft in both air and on land irrespective of aircraft type. However, the distributions of height, pitch , number of passengers and duration do not reveal a linear relationship with distance for either type of carrier.
# Air Speed
p12 <- ggplot(FAA, aes(x = distance, y = speed_air, color = aircraft)) +
geom_point(alpha = 0.9) +
theme_light() +
labs(x = "distance", y = "speed_air") +
theme(legend.title = element_blank(), legend.position = c(0.2, 0.8))
# Ground Speed
p22 <- ggplot(FAA, aes(x = distance, y = speed_ground, color = aircraft)) +
geom_point(alpha = 0.9) +
theme_light() +
labs(x = "distance", y = "speed_ground") +
theme(legend.title = element_blank(), legend.position = "none")
# Height
p32 <- ggplot(FAA, aes(x = distance, y = height, color = aircraft)) +
geom_point(alpha = 0.9) +
theme_light() +
labs(x = "distance", y = "height") +
theme(legend.title = element_blank(), legend.position = "none")
# Pitch
p42 <- ggplot(FAA, aes(x = distance, y = pitch, color = aircraft)) +
geom_point(alpha = 0.9) +
theme_light() +
labs(x = "distance", y = "pitch") +
theme(legend.title = element_blank(), legend.position = "none")
# Duration
p52 <- ggplot(FAA, aes(x = distance, y = duration, color = aircraft)) +
geom_point(alpha = 0.9) +
theme_light() +
labs(x = "distance", y = "duration") +
theme(legend.title = element_blank(), legend.position = "none")
# Number of Passengers
p62 <- ggplot(FAA, aes(x = distance, y = no_pasg, color = aircraft)) +
geom_point(alpha = 0.9) +
theme_light() +
labs(x = "distance", y = "no_pasg") +
theme(legend.title = element_blank(), legend.position = "none")
gridExtra::grid.arrange(p12, p22, p32, p42, p52, p62, ncol = 3)
attach(FAA)
lr1 <- lm(distance ~ no_pasg)
lr2 <- lm(distance ~ duration)
lr3 <- lm(distance ~ height)
lr4 <- lm(distance ~ pitch)
lr5 <- lm(distance ~ aircraft_type)
lr6 <- lm(distance ~ speed_air)
lr7 <- lm(distance ~ speed_ground)
mod <- list(lr1, lr2, lr3, lr4, lr5, lr6, lr7)
r2 <- lapply(mod, FUN = function(lin.mod) summary(lin.mod)$coefficients[8])
r3 <- lapply(mod, FUN = function(lin.mod) summary(lin.mod)$coefficients[2])
r4 <- list('no_pasg','duration','height','pitch','aircraft_type','speed_air','speed_ground')
r <- do.call(rbind, Map(data.frame, features = r4, p_value = r2, Estimate = r3 ))
r %>%
mutate(direction = ifelse(Estimate >= 0,'Positive','Negative')) %>%
dplyr::select(features,p_value,direction) %>%
arrange(p_value) -> table_2
table_2
## features p_value direction
## 1 speed_air 2.514701e-96 Positive
## 2 speed_ground 2.836451e-87 Positive
## 3 aircraft_type 1.088014e-02 Positive
## 4 height 2.324592e-01 Positive
## 5 pitch 6.043719e-01 Positive
## 6 duration 6.094459e-01 Positive
## 7 no_pasg 7.937391e-01 Negative
FAA %>% mutate_at(c(3:8), ~(scale(.) %>% as.vector)) -> FAA_normalized
attach(FAA_normalized)
## The following objects are masked from FAA:
##
## aircraft, aircraft_type, distance, duration, height, no_pasg,
## pitch, speed_air, speed_ground
lr_1 <- lm(distance ~ no_pasg)
lr_2 <- lm(distance ~duration)
lr_3 <- lm(distance ~ height)
lr_4 <- lm(distance ~pitch)
lr_5 <- lm(distance ~ aircraft_type)
lr_6 <- lm(distance ~ speed_air)
lr_7 <- lm(distance ~ speed_ground)
models_norm <- list(lr_1, lr_2, lr_3,lr_4, lr_5, lr_6, lr_7)
coeff_norm <- lapply(models_norm, FUN = function(lin.mod) summary(lin.mod)$coefficients[2])
variable_name <- list('no_pasg','duration','height','pitch','aircraft_type','speed_air','speed_ground')
tab <- do.call(rbind, Map(data.frame, features = variable_name, Estimate = coeff_norm ))
tab %>%
mutate(direction = ifelse(Estimate >= 0,'Positive','Negative')) %>%
arrange(desc(Estimate)) -> table_3
table_3
## features Estimate direction
## 1 speed_air 818.06654 Positive
## 2 speed_ground 806.31435 Positive
## 3 aircraft_type 320.86598 Positive
## 4 height 74.14830 Positive
## 5 pitch 32.22271 Positive
## 6 duration 31.77155 Positive
## 7 no_pasg -16.26460 Negative
names(table_1)
## [1] "features" "coefficient" "direction"
names(table_2)
## [1] "features" "p_value" "direction"
names(table_3)
## [1] "features" "Estimate" "direction"
merge_1 <- merge(table_1, table_3, by = 'features')
merge_1
## features coefficient direction.x Estimate direction.y
## 1 aircraft_type 0.18154185 positive 320.86598 Positive
## 2 duration 0.03671284 positive 31.77155 Positive
## 3 height 0.08568026 positive 74.14830 Positive
## 4 no_pasg -0.01879416 negative -16.26460 Negative
## 5 pitch 0.03723416 positive 32.22271 Positive
## 6 speed_air 0.94529684 positive 818.06654 Positive
## 7 speed_ground 0.93171687 positive 806.31435 Positive
merge_2 <- merge(merge_1, table_2, by = 'features')
merge_2
## features coefficient direction.x Estimate direction.y p_value
## 1 aircraft_type 0.18154185 positive 320.86598 Positive 1.088014e-02
## 2 duration 0.03671284 positive 31.77155 Positive 6.094459e-01
## 3 height 0.08568026 positive 74.14830 Positive 2.324592e-01
## 4 no_pasg -0.01879416 negative -16.26460 Negative 7.937391e-01
## 5 pitch 0.03723416 positive 32.22271 Positive 6.043719e-01
## 6 speed_air 0.94529684 positive 818.06654 Positive 2.514701e-96
## 7 speed_ground 0.93171687 positive 806.31435 Positive 2.836451e-87
## direction
## 1 Positive
## 2 Positive
## 3 Positive
## 4 Negative
## 5 Positive
## 6 Positive
## 7 Positive
table_0 <- subset(merge_2, select = -c(direction.x,direction.y,direction))
table_0
## features coefficient Estimate p_value
## 1 aircraft_type 0.18154185 320.86598 1.088014e-02
## 2 duration 0.03671284 31.77155 6.094459e-01
## 3 height 0.08568026 74.14830 2.324592e-01
## 4 no_pasg -0.01879416 -16.26460 7.937391e-01
## 5 pitch 0.03723416 32.22271 6.043719e-01
## 6 speed_air 0.94529684 818.06654 2.514701e-96
## 7 speed_ground 0.93171687 806.31435 2.836451e-87
table_0 %>% arrange(desc(coefficient))
## features coefficient Estimate p_value
## 1 speed_air 0.94529684 818.06654 2.514701e-96
## 2 speed_ground 0.93171687 806.31435 2.836451e-87
## 3 aircraft_type 0.18154185 320.86598 1.088014e-02
## 4 height 0.08568026 74.14830 2.324592e-01
## 5 pitch 0.03723416 32.22271 6.043719e-01
## 6 duration 0.03671284 31.77155 6.094459e-01
## 7 no_pasg -0.01879416 -16.26460 7.937391e-01
Looking at correlation coffecients from table_1, and beta coefficients and p-value from table_2 and table_3 we find that speed_air, speed_ground, aircraft_type are relatively important features when compared to other features like height, pitch, duration and no_pasg. The table_0 lists the features in their decreasing order of importance.
attach(FAA)
## The following objects are masked from FAA_normalized:
##
## aircraft, aircraft_type, distance, duration, height, no_pasg,
## pitch, speed_air, speed_ground
## The following objects are masked from FAA (pos = 4):
##
## aircraft, aircraft_type, distance, duration, height, no_pasg,
## pitch, speed_air, speed_ground
l1 <- lm(distance ~ speed_ground)
l2 <- lm(distance ~ speed_air)
l3 <- lm(distance ~ speed_ground +speed_air)
model <- list(l1,l2,l3)
coeffecient <- lapply(model,function(lin.mod)
if (length(lin.mod$coefficients) == 2) lin.mod$coefficients[2]
else lin.mod$coefficients[2:3])
Yes there is a significant change in sign between speed_ground & speed_air. We need to check correlation between the two variables.
cor(speed_ground,speed_air)
## [1] 0.988969
The two variable are very highly correlated and we need to keep only one for analysis. We will keep the one which has better univariate fitness, i.e., speed_air.
On the basis of table 0 we have 6 important features: speed_air, aircraft_type, height, pitch, duration, and no_pasg.
list_imp <- c('speed_air','aircraft_type','height','pitch','duration','no_pasg')
m1 <- lm(distance ~ speed_air)
m2 <- lm(distance ~ speed_air + aircraft_type)
m3 <- lm(distance ~ speed_air + aircraft_type + height)
m4 <- lm(distance ~ speed_air + aircraft_type + height + pitch)
m5 <- lm(distance ~ speed_air + aircraft_type + height + pitch + duration)
m6 <- lm(distance ~ speed_air + aircraft_type + height + pitch + duration + no_pasg)
model <- list(m1, m2, m3, m4, m5, m6)
r_sq <- as.data.frame(lapply(model, function(x) summary(x)$r.squared))
colnames(r_sq) <- c("1", "2", "3", "4", "5", "6")
r_sq <- reshape2::melt(r_sq)
## No id variables; using all as measure variables
r_sq$value <- round(r_sq$value, 4)*100
ggplot(r_sq,aes(x = variable,y = value, group = 1)) +
geom_path() +
geom_point(shape = 21, color = "black", fill = "#69b3a2", size = 1) +
labs(x = "Number of Features", y = "R-square percentage") +
geom_text(aes(label = value), size = 4) +
theme(legend.title = element_blank(), legend.position = "none")
R-square increases or stays the same every time a variable is added. It plateaus after 3 features.
r_adj_sq <- as.data.frame(lapply(model,function(x) summary(x)$adj.r.squared))
colnames(r_adj_sq) <- c("1", "2", "3", "4", "5", "6")
r_adj_sq <- reshape2::melt(r_adj_sq)
## No id variables; using all as measure variables
r_adj_sq$value <- round(r_adj_sq$value,5)*100
ggplot(r_adj_sq,aes(x = variable, y = value, group = 1)) +
geom_path() +
geom_point(shape = 21, color = "black", fill = "#69b3a2", size = 1) +
labs(x = "Number of Features", y = "Adj.R-square percentage") +
geom_text(aes(label = value), size = 4) +
theme(legend.title = element_blank(), legend.position = "none")
We observe that the Adjusted R-square values decrease as for each additional feature after the 3rd.
aic_value <- as.data.frame(AIC(m1, m2, m3, m4, m5, m6))
print(aic_value)
## df AIC
## m1 3 2773.274
## m2 4 2622.261
## m3 5 2492.778
## m4 6 2494.348
## m5 7 2495.985
## m6 8 2496.322
From the AIC model we can see the m3 model is better as it has lowest AIC.
From both Adj R-square and AIC we see that model m3 perform better. Features we should select for our model are speed_air, aircraft_type, and height.
# Initialize a model using only the intercept
null_model <- lm(distance ~ 1, data = FAA)
summary(null_model)
##
## Call:
## lm(formula = distance ~ 1, data = FAA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1061.6 -639.7 -268.1 389.6 3507.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2802.48 61.81 45.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 865.4 on 195 degrees of freedom
step <- stepAIC(object = null_model,
scope = list(lower = null_model, upper = m6),
direction = "forward")
## Start: AIC=2652.17
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_air 1 130500410 15540815 2215.1
## + aircraft_type 1 4813145 141228080 2647.6
## <none> 146041225 2652.2
## + height 1 1072104 144969121 2652.7
## + pitch 1 202469 145838756 2653.9
## + duration 1 196839 145844386 2653.9
## + no_pasg 1 51585 145989640 2654.1
##
## Step: AIC=2215.05
## distance ~ speed_air
##
## Df Sum of Sq RSS AIC
## + aircraft_type 1 8421618 7119197 2064.0
## + height 1 2947316 12593499 2175.8
## + pitch 1 891909 14648906 2205.5
## <none> 15540815 2215.1
## + no_pasg 1 115542 15425273 2215.6
## + duration 1 5070 15535745 2217.0
##
## Step: AIC=2064.04
## distance ~ speed_air + aircraft_type
##
## Df Sum of Sq RSS AIC
## + height 1 3479278 3639920 1934.5
## <none> 7119197 2064.0
## + duration 1 45170 7074028 2064.8
## + no_pasg 1 34898 7084299 2065.1
## + pitch 1 14555 7104642 2065.6
##
## Step: AIC=1934.55
## distance ~ speed_air + aircraft_type + height
##
## Df Sum of Sq RSS AIC
## <none> 3639920 1934.5
## + no_pasg 1 32041 3607879 1934.8
## + pitch 1 7971 3631948 1936.1
## + duration 1 7299 3632621 1936.2
step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## distance ~ 1
##
## Final Model:
## distance ~ speed_air + aircraft_type + height
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 195 146041225 2652.172
## 2 + speed_air 1 130500410 194 15540815 2215.050
## 3 + aircraft_type 1 8421618 193 7119197 2064.037
## 4 + height 1 3479278 192 3639920 1934.554
Step AIC chooses distance ~ speed_air + aircraft_type + height as its final model. In other words, m3.