Motivation: To reduce the risk of landing overrun.
Goal: To study what factors and how they would impact the landing distance of a commercial flight.
library(tidyverse) #to visualize, transform, input, tidy and join data
library(dplyr) #data wrangling
library(stringr) #string related functions
library(kableExtra) #to create HTML Table
library(DT) #to preview the data sets
library(lubridate) #to apply the date functions
library(xlsx) #to load excel files
Data has following columns -
| Variable | Description |
|---|---|
| Aircraft | make of an aircraft |
| Duration | Duration of flight |
| No_pasg | no. of passengers |
| speed_ground | ground speed |
| speed_air | air speed |
| height | height |
| pitch | pitch angle |
| distance | flight duration between take-off and landing |
step 1: I load the two datasets-
faa1 <- read.xlsx("FAA1.xls", sheetName = "FAA1")
faa2 <- read.xlsx("FAA2_2.xls", sheetName = "Sheet1")
Step 2: Check the structure of each data set using the “str” function. For each data set, what is the sample size and how many variables? Is there any difference between the two data sets?
For FAA1:
str(faa1)
## 'data.frame': 800 obs. of 8 variables:
## $ aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Data has 800 observations and 8
For FAA2:
str(faa2)
## 'data.frame': 150 obs. of 7 variables:
## $ aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Data has 150 observations and 7
FAA2 doesn’t contain information about the duration of flights
Step 3: Merge the two data sets. Are there any duplications?
faa <- bind_rows(faa1, faa2)
str(faa)
## 'data.frame': 950 obs. of 8 variables:
## $ aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
#add the duplicate removal code
faa %>%
select(-duration) %>%
duplicated() %>%
sum()
## [1] 100
There are 100 duplicated in total, which I have removed.
check <- faa %>%
select(-duration) %>%
duplicated() %>%
which()
faa <- faa[-check,]
Step 4. Check the structure of the combined data set. What is the sample size and how many variables? Provide summary statistics for each variable.
str(faa)
## 'data.frame': 850 obs. of 8 variables:
## $ aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
Data has 850 observations and 8 variables.
summary(faa)
## aircraft duration no_pasg speed_ground
## airbus:450 Min. : 14.76 Min. :29.0 Min. : 27.74
## boeing:400 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90
## Median :153.95 Median :60.0 Median : 79.64
## Mean :154.01 Mean :60.1 Mean : 79.45
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06
## Max. :305.62 Max. :87.0 Max. :141.22
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :101.15 Median :30.093 Median :4.008 Median :1258.09
## Mean :103.80 Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :642
Step 5. Key findings-
I observed that few of the variables have incorrect data, which may be because of the issue with data capture or wrong data entry. For example-
height has negative value as the minimum value
the minimum distance for an observation is 34 which is too small.
air_speed is not captured for 75% of the data.
The minium duration of flight is 15 minutes, which doesn’t seem right
Data had duplicate records(100) after merging the two data-sets
step 6. Are there abnormal values in the data set? Please refer to the variable dictionary for criteria defining “normal/abnormal” values. Remove the rows that contain any “abnormal values” and report how many rows you have removed.
As per obtained summary, speed_air is within the threshold so I won’t apply filter in it.
faa_check <- faa %>%
filter((duration > 40| is.na(duration)) & (speed_ground >= 30) & (speed_ground <= 140) &
(height >= 6) & (distance < 6000))
dim(faa_check)
## [1] 831 8
faa <- faa_check
A total of 19 observations seem abnormal which we remove.
Step 7. Summary
str(faa)
## 'data.frame': 831 obs. of 8 variables:
## $ aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
summary(faa)
## aircraft duration no_pasg speed_ground
## airbus:444 Min. : 41.95 Min. :29.00 Min. : 33.57
## boeing:387 1st Qu.:119.63 1st Qu.:55.00 1st Qu.: 66.20
## Median :154.28 Median :60.00 Median : 79.79
## Mean :154.78 Mean :60.06 Mean : 79.54
## 3rd Qu.:189.66 3rd Qu.:65.00 3rd Qu.: 91.91
## Max. :305.62 Max. :87.00 Max. :132.78
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.23 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.28
## Median :101.12 Median :30.167 Median :4.001 Median :1262.15
## Mean :103.48 Mean :30.458 Mean :4.005 Mean :1522.48
## 3rd Qu.:109.36 3rd Qu.:37.004 3rd Qu.:4.370 3rd Qu.:1936.63
## Max. :132.91 Max. :59.946 Max. :5.927 Max. :5381.96
## NA's :628
Data has 831 observations and 8
We observe that Duration is null for 50 observations, which we need to look at. We will replace the value with mean of the overall column
faa$duration_corrected <- NA
faa <- transform(faa, duration_corrected = ifelse(is.na(faa$duration), mean(faa$duration, na.rm=TRUE), faa$duration))
summary(faa)
## aircraft duration no_pasg speed_ground
## airbus:444 Min. : 41.95 Min. :29.00 Min. : 33.57
## boeing:387 1st Qu.:119.63 1st Qu.:55.00 1st Qu.: 66.20
## Median :154.28 Median :60.00 Median : 79.79
## Mean :154.78 Mean :60.06 Mean : 79.54
## 3rd Qu.:189.66 3rd Qu.:65.00 3rd Qu.: 91.91
## Max. :305.62 Max. :87.00 Max. :132.78
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.23 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.28
## Median :101.12 Median :30.167 Median :4.001 Median :1262.15
## Mean :103.48 Mean :30.458 Mean :4.005 Mean :1522.48
## 3rd Qu.:109.36 3rd Qu.:37.004 3rd Qu.:4.370 3rd Qu.:1936.63
## Max. :132.91 Max. :59.946 Max. :5.927 Max. :5381.96
## NA's :628
## duration_corrected
## Min. : 41.95
## 1st Qu.:122.67
## Median :154.78
## Mean :154.78
## 3rd Qu.:186.37
## Max. :305.62
##
Step 8. Plotting histogram for all the variables.
#hist(faa$duration_impute, main = "Histogram of Duration", xlab = "Duration")
hist(faa$speed_ground, main = "Histogram of Ground Speed", xlab = "Ground Speed")
hist(faa$height, main = "Histogram of Height", xlab = "Height")
hist(faa$pitch, main = "Histogram of Pitch", xlab = "Pitch")
hist(faa$no_pasg, main = "Histogram of No. of Passengers", xlab = "No. of Passengers")
hist(faa$speed_air, main = "Histogram of Air Speed", xlab = "Air Speed")
hist(faa$distance, main = "Histogram of Landing Distance", xlab = "Landing Distance")
hist(faa$duration_corrected, main = "Histogram of Duration of flight", xlab = "Flight Duration in mins")
Step 9. Key finding:
After cleaning the data, I observed that - 1. There were total 19 abnormal values in the data
Duration has 50 NA values, which we corrected based on the mean of the overall sample
Speed of the air is right-skewed whereas all the other variables seem to be noramlly distributed
Min speed of air is 90 MPH
response variable “landing distance”
Step 10. Pairwise Correlation
cor_duration <- cor(faa$distance, faa$duration_corrected)
cor_speed_ground <- cor(faa$distance,faa$speed_ground)
cor_height <- cor(faa$distance,faa$height)
cor_pitch <- cor(faa$distance,faa$pitch)
cor_no_pasg <- cor(faa$distance,faa$no_pasg)
cor_speed_air <- cor.test(faa$distance,faa$speed_air,method="pearson")$estimate
cor_aircraft <- cor(faa$distance,as.numeric(faa$aircraft ))
variable_names <- c("Duration","Ground Speed","Height","Pitch","No. of Passengers","Air Speed","Aircraft")
correlation <- c(cor_duration,cor_speed_ground,cor_height,cor_pitch,cor_no_pasg,cor_speed_air,cor_aircraft)
table_1 <- data.frame(variable_names,correlation)
table_1$direction <- ifelse(table_1$correlation > 0, "Positive","Negative")
table_1 <- table_1 %>% arrange(desc(correlation))
Step 11. Show X-Y scatter plots
faa <- faa[-2]
GGally::ggpairs(
data = faa
)
The plots seem pretty consistent with the correlation values.
Step 12. Encoding aircraft type
GGally::ggpairs(
data = faa, diag = list(continuous =
"densityDiag", discrete = "barDiag", na = "naDiag")
)
Step 13. Regress Y (landing distance) on each of the X variables.
mdl_duration <- lm (faa$distance ~ faa$duration_corrected)
mdl_speedgrnd <- lm (faa$distance ~ faa$speed_ground)
mdl_height <- lm (faa$distance ~ faa$height)
mdl_pitch <- lm (faa$distance ~ faa$pitch)
mdl_nopasg <- lm (faa$distance ~ faa$no_pasg)
mdl_speedair <- lm (faa$distance ~ faa$speed_air)
mdl_aircraft <- lm (faa$distance ~ faa$aircraft)
duration <- summary(mdl_duration)$coef[2,c(1,4)]
speed_ground <- summary(mdl_speedgrnd)$coef[2,c(1,4)]
height <- summary(mdl_height)$coef[2,c(1,4)]
pitch <- summary(mdl_pitch)$coef[2,c(1,4)]
no_pasg <- summary(mdl_nopasg)$coef[2,c(1,4)]
speed_air <- summary(mdl_speedair)$coef[2,c(1,4)]
aircraft_boeing <- summary(mdl_aircraft)$coef[2,c(1,4)]
aircraft_airbus <- summary(mdl_aircraft)$coef[1,c(1,4)]
variable_names <- c("Duration","Ground Speed","Height","Pitch","No. of Passengers","Air Speed","Aircraft-Boeing", "Aircraft-Airbus")
slope <- c(duration[1], speed_ground[1], height[1], pitch[1], no_pasg[1],speed_air[1],aircraft_boeing[1],aircraft_airbus[1])
slope <- round(slope, digits = 3)
p_value <- c(duration[2], speed_ground[2], height[2], pitch[2], no_pasg[2],speed_air[2],aircraft_boeing[2],aircraft_airbus[2])
p_value <- round(p_value, digits = 3)
table_2 <- data.frame(variable_names, slope, p_value)
table_2$slope_direction <- ifelse(slope > 0 , "Positive", "Negative")
table_2 <- table_2 %>%
select(variable_names, p_value, slope_direction) %>%
arrange(p_value)
table_2
## variable_names p_value slope_direction
## 1 Ground Speed 0.000 Positive
## 2 Air Speed 0.000 Positive
## 3 Aircraft-Boeing 0.000 Positive
## 4 Aircraft-Airbus 0.000 Positive
## 5 Height 0.004 Positive
## 6 Pitch 0.012 Positive
## 7 Duration 0.148 Negative
## 8 No. of Passengers 0.609 Negative
All the factors are significant except for duration and number of passengers.
Step 14. Standardize each X variable
faa_adj <- faa
faa_adj$duration <- scale(faa_adj$duration_corrected, center = TRUE, scale = TRUE)
faa_adj$speed_ground <- scale(faa_adj$speed_ground, center = TRUE, scale = TRUE)
faa_adj$height <- scale(faa_adj$height, center = TRUE, scale = TRUE)
faa_adj$pitch <- scale(faa_adj$pitch, center = TRUE, scale = TRUE)
faa_adj$no_pasg <- scale(faa_adj$no_pasg, center = TRUE, scale = TRUE)
faa_adj$speed_air <- scale(faa_adj$speed_air, center = TRUE, scale = TRUE)
mdl_duration <- lm (faa_adj$distance ~ faa_adj$duration_corrected)
mdl_speedgrnd <- lm (faa_adj$distance ~ faa_adj$speed_ground)
mdl_height <- lm (faa_adj$distance ~ faa_adj$height)
mdl_pitch <- lm (faa_adj$distance ~ faa_adj$pitch)
mdl_nopasg <- lm (faa_adj$distance ~ faa_adj$no_pasg)
mdl_speedair <- lm (faa_adj$distance ~ faa_adj$speed_air)
mdl_aircraft <- lm (faa_adj$distance ~ faa_adj$aircraft)
duration <- summary(mdl_duration)$coef[2,c(1,4)]
speed_ground <- summary(mdl_speedgrnd)$coef[2,c(1,4)]
height <- summary(mdl_height)$coef[2,c(1,4)]
pitch <- summary(mdl_pitch)$coef[2,c(1,4)]
no_pasg <- summary(mdl_nopasg)$coef[2,c(1,4)]
speed_air <- summary(mdl_speedair)$coef[2,c(1,4)]
aircraft_boeing <- summary(mdl_aircraft)$coef[2,c(1,4)]
aircraft_airbus <- summary(mdl_aircraft)$coef[1,c(1,4)]
variable_names <- c("Duration","Ground Speed","Height","Pitch","No. of Passengers","Air Speed","Aircraft-Boeing", "Aircraft-Airbus")
slope <- c(duration[1], speed_ground[1], height[1], pitch[1], no_pasg[1],speed_air[1],aircraft_boeing[1],aircraft_airbus[1])
slope <- round(slope, digits = 3)
p_value <- c(duration[2], speed_ground[2], height[2], pitch[2], no_pasg[2],speed_air[2],aircraft_boeing[2],aircraft_airbus[2])
p_value <- round(p_value, digits = 3)
table_3 <- data.frame(variable_names, slope, p_value)
table_3$slope_direction <- ifelse(slope > 0 , "Positive", "Negative")
table_3 <- table_3 %>%
select(variable_names, slope, slope_direction) %>%
arrange(desc(slope))
table_3
## variable_names slope slope_direction
## 1 Aircraft-Airbus 1323.317 Positive
## 2 Ground Speed 776.447 Positive
## 3 Air Speed 774.347 Positive
## 4 Aircraft-Boeing 427.666 Positive
## 5 Height 89.106 Positive
## 6 Pitch 78.007 Positive
## 7 Duration -0.961 Negative
## 8 No. of Passengers -15.916 Negative
There is no difference observed after normalization.
Step 15. Creating Table 0
table_0 <- merge(table_1, table_2) %>%
merge(table_3[,-3], by = "variable_names") %>%
arrange(desc(slope))
table_0
## variable_names correlation direction p_value slope_direction slope
## 1 Ground Speed 0.86624383 Positive 0.000 Positive 776.447
## 2 Air Speed 0.94209714 Positive 0.000 Positive 774.347
## 3 Height 0.09941121 Positive 0.004 Positive 89.106
## 4 Pitch 0.08702846 Positive 0.012 Positive 78.007
## 5 Duration -0.05026941 Negative 0.148 Negative -0.961
## 6 No. of Passengers -0.01775663 Negative 0.609 Negative -15.916
Step 16. Checking Collinearity
We see that both air_speed and ground_speed are closely related to the landing distance.
model_1 <- lm(distance ~ speed_ground,data=faa)
model_2 <- lm(distance ~ speed_air,data=faa)
model_3 <- lm(distance ~ speed_ground + speed_air,data=faa)
summary(model_1)
##
## Call:
## lm(formula = distance ~ speed_ground, data = faa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -897.09 -319.16 -72.09 210.83 1798.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1773.9407 67.8388 -26.15 <2e-16 ***
## speed_ground 41.4422 0.8302 49.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 448.1 on 829 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.7501
## F-statistic: 2492 on 1 and 829 DF, p-value: < 2.2e-16
summary(model_2)
##
## Call:
## lm(formula = distance ~ speed_air, data = faa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -776.21 -196.39 8.72 209.17 624.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5455.709 207.547 -26.29 <2e-16 ***
## speed_air 79.532 1.997 39.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.3 on 201 degrees of freedom
## (628 observations deleted due to missingness)
## Multiple R-squared: 0.8875, Adjusted R-squared: 0.887
## F-statistic: 1586 on 1 and 201 DF, p-value: < 2.2e-16
summary(model_3)
##
## Call:
## lm(formula = distance ~ speed_ground + speed_air, data = faa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -819.74 -202.02 3.52 211.25 636.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5462.28 207.48 -26.327 < 2e-16 ***
## speed_ground -14.37 12.68 -1.133 0.258
## speed_air 93.96 12.89 7.291 6.99e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 200 degrees of freedom
## (628 observations deleted due to missingness)
## Multiple R-squared: 0.8883, Adjusted R-squared: 0.8871
## F-statistic: 795 on 2 and 200 DF, p-value: < 2.2e-16
Since they are highly correlated, when we regress both the variables together, we see that ground_speed becomes insignificant as the variability due to speed_ground is explained by speed_Air and hence it is not contributing in explaining variation any more.
cor.test(faa$speed_ground, faa$speed_air, method = "pearson")$estimate
## cor
## 0.9879383
We observe 98% correlation. I would like to keep speed_air in my method because since R^2 and adj. R^2 is more when speed_air is considered.(comparison of the two models- 1 and 2.) Thus, speed of air is a significant contributor according to me.
Step 17. R squared vs No. of variables
model0 <- lm(distance ~ 1,data=faa)
model1 <- lm(distance ~ speed_air,data=faa)
model2 <- lm(distance ~ speed_air + aircraft,data=faa)
model3 <- lm(distance ~ speed_air + aircraft + height ,data=faa)
model4 <- lm(distance ~ speed_air + aircraft + height + no_pasg ,data=faa)
model5 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected ,data=faa)
model6 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected + pitch,data=faa)
model0.rsqr <- summary(model0)$r.squared
model1.rsqr <- summary(model1)$r.squared
model2.rsqr <- summary(model2)$r.squared
model3.rsqr <- summary(model3)$r.squared
model4.rsqr <- summary(model4)$r.squared
model5.rsqr <- summary(model5)$r.squared
model6.rsqr <- summary(model6)$r.squared
rsquare <- cbind(c(model0.rsqr, model1.rsqr, model2.rsqr, model3.rsqr, model4.rsqr, model5.rsqr, model6.rsqr), 0:6)
colnames(rsquare) <- c("rsquare","variables")
rsquare <- as.data.frame(rsquare)
rsquare %>%
ggplot(aes(x = variables, y = rsquare)) +
geom_line() +
xlab("no. of variables") +
ylab("R-square") +
theme_classic()
With increase in variables, R^2 also increases.
Step 18.
model0.rsqr <- summary(model0)$adj.r.squared
model1.rsqr <- summary(model1)$adj.r.squared
model2.rsqr <- summary(model2)$adj.r.squared
model3.rsqr <- summary(model3)$adj.r.squared
model4.rsqr <- summary(model4)$adj.r.squared
model5.rsqr <- summary(model5)$adj.r.squared
model6.rsqr <- summary(model6)$adj.r.squared
rsquare <- cbind(c(model0.rsqr, model1.rsqr, model2.rsqr, model3.rsqr, model4.rsqr, model5.rsqr, model6.rsqr), 0:6)
colnames(rsquare) <- c("adj_rsquare","variables")
rsquare <- as.data.frame(rsquare)
rsquare %>%
ggplot(aes(x = variables, y = adj_rsquare)) +
geom_line() +
xlab("no. of variables") +
ylab("Adjusted R-square") +
theme_classic()
We see that Adjusted R^2 increases initially but then it slowly starts declining after 3 variables have been added.
Step 19. Model with AIC values
model0_AIC <- AIC(model0)
model1_AIC <- AIC(model1)
model2_AIC <- AIC(model2)
model3_AIC <- AIC(model3)
model4_AIC <- AIC(model4)
model5_AIC <- AIC(model5)
model6_AIC <- AIC(model6)
AIC <- cbind(c(model0_AIC, model1_AIC, model2_AIC, model3_AIC, model4_AIC, model5_AIC, model6_AIC),0:6)
colnames(AIC) <- c("AIC","variables")
AIC <- as.data.frame(AIC)
AIC %>%
ggplot(aes(x = variables, y = AIC)) +
geom_line() +
xlab("no. of variables")+
ylab("AIC") +
theme_classic()
since smaller the AIC, better is the model. Hence, we see that it decreases. However, after addition of 3 variables, the decrease in AIC isn’t much and infact it starts going up.
Step 20. suitable model
According to me, the significant model is -
model4 <- lm(distance ~ speed_air + aircraft + height ,data=faa)
summary(model4)
##
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = faa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -300.74 -94.78 15.47 97.09 330.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6390.376 109.839 -58.18 <2e-16 ***
## speed_air 82.148 0.976 84.17 <2e-16 ***
## aircraftboeing 427.442 19.173 22.29 <2e-16 ***
## height 13.702 1.007 13.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.2 on 199 degrees of freedom
## (628 observations deleted due to missingness)
## Multiple R-squared: 0.9737, Adjusted R-squared: 0.9733
## F-statistic: 2458 on 3 and 199 DF, p-value: < 2.2e-16
The R^2 and Adjusted R^2 are also close and around 97% which is quite right.
step 21. using “StepAIC” to perform forward variable selection
Since stepAIC doesn’t work in NA values, I would remove the NA values from my data before running the algorithm.
faaNoNA <- na.exclude(faa)
#nrow(faaNoNA)
model01 <- lm(distance ~ 1,data=faaNoNA)
model11 <- lm(distance ~ speed_air,data=faaNoNA)
model21 <- lm(distance ~ speed_air + aircraft,data=faaNoNA)
model31 <- lm(distance ~ speed_air + aircraft + height ,data=faaNoNA)
model41 <- lm(distance ~ speed_air + aircraft + height + no_pasg ,data=faaNoNA)
model51 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected ,data=faaNoNA)
model61 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration_corrected + pitch,data=faaNoNA)
MASS::stepAIC(model01,direction="forward",scope=list(upper=model61,lower=model01))
## Start: AIC=2725.93
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_air 1 121121738 15346230 2284.3
## + aircraft 1 4417704 132050265 2721.2
## <none> 136467968 2725.9
## + height 1 633675 135834293 2727.0
## + duration_corrected 1 353842 136114126 2727.4
## + pitch 1 244706 136223262 2727.6
## + no_pasg 1 217724 136250245 2727.6
##
## Step: AIC=2284.33
## distance ~ speed_air
##
## Df Sum of Sq RSS AIC
## + aircraft 1 8424832 6921399 2124.7
## + height 1 2803377 12542854 2245.4
## + pitch 1 860427 14485804 2274.6
## + no_pasg 1 159095 15187136 2284.2
## <none> 15346230 2284.3
## + duration_corrected 1 11938 15334292 2286.2
##
## Step: AIC=2124.7
## distance ~ speed_air + aircraft
##
## Df Sum of Sq RSS AIC
## + height 1 3335147 3586252 1993.2
## <none> 6921399 2124.7
## + duration_corrected 1 61095 6860304 2124.9
## + no_pasg 1 56195 6865204 2125.0
## + pitch 1 2584 6918815 2126.6
##
## Step: AIC=1993.22
## distance ~ speed_air + aircraft + height
##
## Df Sum of Sq RSS AIC
## + no_pasg 1 53331 3532921 1992.2
## <none> 3586252 1993.2
## + duration_corrected 1 12912 3573340 1994.5
## + pitch 1 174 3586078 1995.2
##
## Step: AIC=1992.18
## distance ~ speed_air + aircraft + height + no_pasg
##
## Df Sum of Sq RSS AIC
## <none> 3532921 1992.2
## + duration_corrected 1 9532.8 3523389 1993.6
## + pitch 1 333.8 3532588 1994.2
##
## Call:
## lm(formula = distance ~ speed_air + aircraft + height + no_pasg,
## data = faaNoNA)
##
## Coefficients:
## (Intercept) speed_air aircraftboeing height
## -6248.753 82.131 425.592 13.696
## no_pasg
## -2.316
I observe that number of passengers is also considered a signicant contributor by automatic selection of variables.
so the final model as obtained here is - model4 i.e.
Distance = - 6263.754 + 82.032( speed_air) + 432.074 (aircraft) + 13.776(height) - 2.041(no_pasg)
From this, we can also conclude that there is no single criteria for selecting a model. Also there is no “best model”. Since, we have chosen AIC as our selection criteria, we see no_pasg also included in our model. However, if I try to include this variable using P_value method, no_pasg will turn out to be non-significant.