Background
Data set: FAA Flight landing
Motivation: To reduce the risk of landing overrun
Goal: To study the factors and their impact on the landing distance for commercial flights
Variable Dictionary
Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
No_pasg:,The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.
Reading Data
faa1 = read_xls("./data/FAA1.xls")
faa2 = read_xls("./data/FAA2.xls")
str(faa1)## tibble[,8] [800 x 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:800] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:800] 109 103 NA NA NA ...
## $ height : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:800] 3370 2988 1145 1664 1050 ...
str(faa2)## tibble[,7] [150 x 7] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:150] 109 103 NA NA NA ...
## $ height : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:150] 3370 2988 1145 1664 1050 ...
We see the second dataset has one less feature available. Dataset-1 has 800 observations with 8 feaatures. Dataset-2 has 150 observations with 7 features.
Merging 2 data sets
faa2$duration = rep(NA, 150)
data = rbind(faa1, faa2)
features = colnames(data)Checking for the duplicate records in the data.
data = data %>% distinct(aircraft,no_pasg,speed_ground,speed_air,height,pitch,distance, .keep_all = T)
data$aircraft = as.factor(data$aircraft)
glimpse(data)## Rows: 850
## Columns: 8
## $ aircraft <fct> boeing, boeing, boeing, boeing, boeing, boeing, boeing, b~
## $ duration <dbl> 98.47909, 125.73330, 112.01700, 196.82569, 90.09538, 137.~
## $ no_pasg <dbl> 53, 69, 61, 56, 70, 55, 54, 57, 61, 56, 61, 54, 54, 58, 6~
## $ speed_ground <dbl> 107.91568, 101.65559, 71.05196, 85.81333, 59.88853, 75.01~
## $ speed_air <dbl> 109.32838, 102.85141, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ height <dbl> 27.41892, 27.80472, 18.58939, 30.74460, 32.39769, 41.2149~
## $ pitch <dbl> 4.043515, 4.117432, 4.434043, 3.884236, 4.026096, 4.20385~
## $ distance <dbl> 3369.8364, 2987.8039, 1144.9224, 1664.2182, 1050.2645, 16~
There were 100 duplicate records and were removed from the data. The combined data set, after removing duplicates has 850 observations.
summary(data)## aircraft duration no_pasg speed_ground speed_air
## airbus:450 Min. : 14.76 Min. :29.0 Min. : 27.74 Min. : 90.00
## boeing:400 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90 1st Qu.: 96.25
## Median :153.95 Median :60.0 Median : 79.64 Median :101.15
## Mean :154.01 Mean :60.1 Mean : 79.45 Mean :103.80
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06 3rd Qu.:109.40
## Max. :305.62 Max. :87.0 Max. :141.22 Max. :141.72
## NA's :50 NA's :642
## height pitch distance
## Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :30.093 Median :4.008 Median :1258.09
## Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :59.946 Max. :5.927 Max. :6533.05
##
Observations on Dataset
- There is difference number of features in 2 datasets given
- The Dataset has balanced observations for both Aircraft makers
- There are total of 850 Observations combined with 7 potential features
- Speed_air has about ~75% missing values, rendering it to be mostly useless
- There are some abnormal values present in the data, which are to be cleaned and processed
Understanding data
From the data description, every variable has a permissible range. If a record has any value outside this range, then it is considered as an abnormal observation. We’ll be removing all the abnormal observations from the data.
data = data %>% filter(
(duration >= 40 | is.na(duration)) &
(speed_ground >= 30 & speed_ground <= 140 | is.na(speed_ground)) &
(height >= 6 | is.na(height)) &
(distance <= 6000| is.na(distance))
)
summary(data)## aircraft duration no_pasg speed_ground
## airbus:444 Min. : 41.95 Min. :29.00 Min. : 33.57
## boeing:387 1st Qu.:119.63 1st Qu.:55.00 1st Qu.: 66.20
## Median :154.28 Median :60.00 Median : 79.79
## Mean :154.78 Mean :60.06 Mean : 79.54
## 3rd Qu.:189.66 3rd Qu.:65.00 3rd Qu.: 91.91
## Max. :305.62 Max. :87.00 Max. :132.78
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.23 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.28
## Median :101.12 Median :30.167 Median :4.001 Median :1262.15
## Mean :103.48 Mean :30.458 Mean :4.005 Mean :1522.48
## 3rd Qu.:109.36 3rd Qu.:37.004 3rd Qu.:4.370 3rd Qu.:1936.63
## Max. :132.91 Max. :59.946 Max. :5.927 Max. :5381.96
## NA's :628
19 Abnormal observations were removed from the data.
Checking for missing data records.
colSums(is.na(data[,2:8]))## duration no_pasg speed_ground speed_air height pitch
## 50 0 0 628 0 0
## distance
## 0
We see there are 628 Nulls in the speed_air and 50 nulls in the duration. Since about 75% of the data is missing in the column speed_air, removing that column would be better. For duration, we’ll try imputing the data.
data = data %>% mutate(duration=ifelse(is.na(duration), median(duration, na.rm=TRUE), duration))
colSums(is.na(data[,2:7]))## duration no_pasg speed_ground speed_air height pitch
## 0 0 0 628 0 0
str(data)## tibble[,8] [831 x 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : num [1:831] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:831] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:831] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:831] 109 103 NA NA NA ...
## $ height : num [1:831] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:831] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:831] 3370 2988 1145 1664 1050 ...
Distributions of featues and landing distance.
data[,-1] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(aes(y = stat(density)), bins = 10) +
geom_density(col = "red", size = 1)Observations on cleaned dataset
- we have a total of 831 observations and 6 potential features (ignoring speed_air)
- 50 missing values in the duration column were imputed with median
- Landing distance seems to be normally distributed
Correlation Analysis of the data.
SLR Model with all variables and landing distance
cols = colnames(data)
table2 = data.frame(matrix(ncol = 4, nrow = 0))
colnames(table2) = c("variable","coef", "P_Val", "Direction")
for(i in c(1:7)){
fm = paste("distance ~", cols[i])
model = lm(fm, data= data)
modelSummary = summary(model)
direction = ifelse(modelSummary$coefficients[2,1] > 0, "Positive", "Negative")
pval = modelSummary$coefficients[2,4]
coef_val = modelSummary$coefficients[2,1]
table2[nrow(table2)+1,] = c(cols[i], abs(coef_val), pval, direction)
}
print(table2)## variable coef P_Val Direction
## 1 aircraft 427.666335349148 3.52619444130319e-12 Positive
## 2 duration 0.957383575177885 0.149327704730143 Negative
## 3 no_pasg 2.12458597966199 0.609252001778225 Negative
## 4 speed_ground 41.4421888175468 4.76637093869623e-252 Positive
## 5 speed_air 79.5320952653839 2.50046087022476e-97 Positive
## 6 height 9.10656853901242 0.00412385985399162 Positive
## 7 pitch 148.141877548958 0.0120812424909119 Positive
Standardizing data
data_sd = data %>% mutate(across(where(is.numeric), ~ (. - mean(., na.rm = T))/ sd(., na.rm = T)) )
data_sd[,-1] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(aes(y = stat(density)), bins = 10) +
geom_density(col = "red", size = 1) All features are standardized to have zero mean and unit variance.
Modeling with the standardized data
cols = colnames(data)
table3 = data.frame(matrix(ncol = 4, nrow = 0))
colnames(table3) = c("variable", "coef", "pval", "Direction")
for(i in c(1:7)){
fm = paste("distance ~", cols[i])
model = lm(fm, data= data_sd)
modelSummary = summary(model)
coef_val = modelSummary$coefficients[2,1]
pval = modelSummary$coefficients[2,4]
modeldirection = ifelse(coef_val > 0.0, "Positive", "Negative")
table3[nrow(table3)+1,] = c( cols[i], abs(coef_val), pval, modeldirection)
}
print(table3)## variable coef pval Direction
## 1 aircraft 0.477126109401757 3.52619444130326e-12 Positive
## 2 duration 0.0500633017014816 0.149327704730145 Negative
## 3 no_pasg 0.0177566313292344 0.609252001778232 Negative
## 4 speed_ground 0.866243834797758 4.76637093869894e-252 Positive
## 5 speed_air 0.863900011442723 2.5004608702254e-97 Positive
## 6 height 0.0994112051009979 0.0041238598539917 Positive
## 7 pitch 0.0870284580029236 0.012081242490912 Positive
Checking for the factors that are highly affecting landing distance.
table0 = table3 %>% arrange(desc(coef))
print(table0)## variable coef pval Direction
## 1 speed_ground 0.866243834797758 4.76637093869894e-252 Positive
## 2 speed_air 0.863900011442723 2.5004608702254e-97 Positive
## 3 aircraft 0.477126109401757 3.52619444130326e-12 Positive
## 4 height 0.0994112051009979 0.0041238598539917 Positive
## 5 pitch 0.0870284580029236 0.012081242490912 Positive
## 6 duration 0.0500633017014816 0.149327704730145 Negative
## 7 no_pasg 0.0177566313292344 0.609252001778232 Negative
The 3 tables show similar kind of results. Hence, we can use table 3 to build table0. The above table shows the features which we can use to predict the Landing distance arranged based on their potential.
Checking for Collinear features
model1 = lm(distance~ speed_air, data = data)
summary(model1)##
## Call:
## lm(formula = distance ~ speed_air, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -776.21 -196.39 8.72 209.17 624.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5455.709 207.547 -26.29 <2e-16 ***
## speed_air 79.532 1.997 39.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.3 on 201 degrees of freedom
## (628 observations deleted due to missingness)
## Multiple R-squared: 0.8875, Adjusted R-squared: 0.887
## F-statistic: 1586 on 1 and 201 DF, p-value: < 2.2e-16
model2 = lm(distance~ speed_ground, data = data)
summary(model2)##
## Call:
## lm(formula = distance ~ speed_ground, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -897.09 -319.16 -72.09 210.83 1798.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1773.9407 67.8388 -26.15 <2e-16 ***
## speed_ground 41.4422 0.8302 49.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 448.1 on 829 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.7501
## F-statistic: 2492 on 1 and 829 DF, p-value: < 2.2e-16
model3 = lm(distance~ speed_ground + speed_air, data = data)
summary(model3)##
## Call:
## lm(formula = distance ~ speed_ground + speed_air, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -819.74 -202.02 3.52 211.25 636.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5462.28 207.48 -26.327 < 2e-16 ***
## speed_ground -14.37 12.68 -1.133 0.258
## speed_air 93.96 12.89 7.291 6.99e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 200 degrees of freedom
## (628 observations deleted due to missingness)
## Multiple R-squared: 0.8883, Adjusted R-squared: 0.8871
## F-statistic: 795 on 2 and 200 DF, p-value: < 2.2e-16
cor.test(data$speed_ground, data$speed_air)##
## Pearson's product-moment correlation
##
## data: data$speed_ground and data$speed_air
## t = 90.453, df = 201, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9841163 0.9908449
## sample estimates:
## cor
## 0.9879383
we see both of them are highly correlated. We’ll remove speed_air as there are more missing values in the that column.
Variable selection
table0 = table0 %>% filter(variable != "speed_air" )
features = table0$variable
models = list()
colSums(is.na(data_sd))## aircraft duration no_pasg speed_ground speed_air height
## 0 0 0 0 628 0
## pitch distance logDistance make
## 0 0 0 0
modelStats = data.frame(matrix(ncol = 5, nrow = 0 ))
colnames(modelStats) = c("fm", "R-Sqr", "Adj. R-Sqr", "AIC", "BIC")
for(i in 1:length(features)){
fm = paste("distance ~ ", paste(features[1:i], collapse = " + "))
model = lm(fm, data = data_sd)
aicVal =AIC(model)
bicVal = BIC(model)
sm = summary(model)
print(paste("##################", fm, "###################"))
print(sm)
rsq = sm$r.squared
arsq = sm$adj.r.squared
modelStats[i,] = c(fm, rsq, arsq, aicVal, bicVal)
}## [1] "################## distance ~ speed_ground ###################"
##
## Call:
## lm(formula = fm, data = data_sd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00084 -0.35607 -0.08043 0.23522 2.00692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.931e-16 1.734e-02 0.00 1
## speed_ground 8.662e-01 1.735e-02 49.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4999 on 829 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.7501
## F-statistic: 2492 on 1 and 829 DF, p-value: < 2.2e-16
##
## [1] "################## distance ~ speed_ground + aircraft ###################"
##
## Call:
## lm(formula = fm, data = data_sd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83595 -0.28591 -0.07563 0.16779 1.72017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.25531 0.01988 -12.85 <2e-16 ***
## speed_ground 0.87731 0.01454 60.32 <2e-16 ***
## aircraftboeing 0.54823 0.02914 18.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4187 on 828 degrees of freedom
## Multiple R-squared: 0.8251, Adjusted R-squared: 0.8247
## F-statistic: 1953 on 2 and 828 DF, p-value: < 2.2e-16
##
## [1] "################## distance ~ speed_ground + aircraft + height ###################"
##
## Call:
## lm(formula = fm, data = data_sd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7943 -0.2530 -0.1006 0.1451 1.6421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.25773 0.01849 -13.94 <2e-16 ***
## speed_ground 0.88632 0.01355 65.41 <2e-16 ***
## aircraftboeing 0.55341 0.02711 20.41 <2e-16 ***
## height 0.15444 0.01354 11.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3894 on 827 degrees of freedom
## Multiple R-squared: 0.8489, Adjusted R-squared: 0.8484
## F-statistic: 1549 on 3 and 827 DF, p-value: < 2.2e-16
##
## [1] "################## distance ~ speed_ground + aircraft + height + pitch ###################"
##
## Call:
## lm(formula = fm, data = data_sd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7997 -0.2500 -0.1040 0.1426 1.6745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.25005 0.01908 -13.11 <2e-16 ***
## speed_ground 0.88686 0.01354 65.49 <2e-16 ***
## aircraftboeing 0.53693 0.02895 18.55 <2e-16 ***
## height 0.15382 0.01353 11.37 <2e-16 ***
## pitch 0.02327 0.01445 1.61 0.108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.389 on 826 degrees of freedom
## Multiple R-squared: 0.8494, Adjusted R-squared: 0.8486
## F-statistic: 1164 on 4 and 826 DF, p-value: < 2.2e-16
##
## [1] "################## distance ~ speed_ground + aircraft + height + pitch + duration ###################"
##
## Call:
## lm(formula = fm, data = data_sd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7970 -0.2499 -0.1035 0.1443 1.6757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.250143 0.019092 -13.102 <2e-16 ***
## speed_ground 0.887016 0.013567 65.381 <2e-16 ***
## aircraftboeing 0.537130 0.028982 18.533 <2e-16 ***
## height 0.153796 0.013542 11.357 <2e-16 ***
## pitch 0.023381 0.014467 1.616 0.106
## duration 0.003177 0.013549 0.234 0.815
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3893 on 825 degrees of freedom
## Multiple R-squared: 0.8494, Adjusted R-squared: 0.8485
## F-statistic: 930.5 on 5 and 825 DF, p-value: < 2.2e-16
##
## [1] "################## distance ~ speed_ground + aircraft + height + pitch + duration + no_pasg ###################"
##
## Call:
## lm(formula = fm, data = data_sd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7636 -0.2503 -0.1016 0.1394 1.6886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.249825 0.019084 -13.091 <2e-16 ***
## speed_ground 0.887007 0.013560 65.413 <2e-16 ***
## aircraftboeing 0.536446 0.028972 18.516 <2e-16 ***
## height 0.154665 0.013550 11.414 <2e-16 ***
## pitch 0.023123 0.014461 1.599 0.110
## duration 0.002492 0.013551 0.184 0.854
## no_pasg -0.018327 0.013534 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3891 on 824 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8486
## F-statistic: 776.5 on 6 and 824 DF, p-value: < 2.2e-16
modelStats = modelStats %>% mutate(across(c(2,3,4,5), as.double))
modelStats[,2:3] %>% gather() %>% ggplot(aes(x=seq(1:12), y = value, group = key, col = key)) +
geom_line() +
geom_point() +
ggtitle("Variation in R-Sqr/ Adj.R-Sqr with addition of features") +
ylab("R-Sqr/ Adj.R-Sqr Value") +
xlab("No. of Features") +
theme( axis.text.x = element_blank())modelStats[,4:5] %>% gather() %>% ggplot(aes(x=seq(1:12), y = value, group = key, col = key)) +
geom_line() +
geom_point() +
ggtitle("Variation in AIC/ BIC with addition of features") +
ylab("AIC/BIC") +
xlab("No. of Features") +
theme( axis.text.x = element_blank()) We can see from the from R-squared graph, the R-squared value becomes flat after 3 features, same with Adj. R-Squared.The AIC and BIC curves also show very marginal improvement after 3 feaature. Hence, the above stats it looks likes Landing distance can be predicted with help of speed_ground, aircraft make and height. The estimated coeffs for the model are: Coefficients:
print(summary(lm( distance ~ speed_ground + aircraft + height, data = data)))##
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -711.95 -226.73 -90.17 130.04 1471.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2512.2433 68.1974 -36.84 <2e-16 ***
## speed_ground 42.4024 0.6483 65.41 <2e-16 ***
## aircraftboeing 496.0452 24.2975 20.41 <2e-16 ***
## height 14.1478 1.2405 11.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 349.1 on 827 degrees of freedom
## Multiple R-squared: 0.8489, Adjusted R-squared: 0.8484
## F-statistic: 1549 on 3 and 827 DF, p-value: < 2.2e-16
Model for Landing Distance
\[ Landing Distance = -2512.24 + 42.4 * speed\_ground + 496.04 * Aircraft Type(Boeing) + 14.14 * height \] The above equation can be used to predict the Landing Distance.
Feature selection with StepAIC fuction.
d = data[,c(1:4, 6:8)]
baseModel = lm(distance ~ 1, data = d)
topModel = lm(distance~., data = d)
AIC(baseModel)## [1] 13660.08
AIC(topModel)## [1] 12097.14
stepAIC(baseModel, direction = "forward", scope = list(upper = topModel, lower = baseModel))## Start: AIC=11299.8
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_ground 1 500382567 166457762 10148
## + aircraft 1 37818390 629021939 11253
## + height 1 6590108 660250221 11294
## + pitch 1 5050617 661789712 11296
## + duration 1 1671325 665169005 11300
## <none> 666840329 11300
## + no_pasg 1 210253 666630076 11302
##
## Step: AIC=10148.53
## distance ~ speed_ground
##
## Df Sum of Sq RSS AIC
## + aircraft 1 49848656 116609106 9854.8
## + height 1 14916377 151541385 10072.5
## + pitch 1 9765095 156692668 10100.3
## <none> 166457762 10148.5
## + no_pasg 1 207528 166250234 10149.5
## + duration 1 49785 166407977 10150.3
##
## Step: AIC=9854.77
## distance ~ speed_ground + aircraft
##
## Df Sum of Sq RSS AIC
## + height 1 15848830 100760276 9735.4
## + pitch 1 455453 116153653 9853.5
## <none> 116609106 9854.8
## + no_pasg 1 87171 116521935 9856.1
## + duration 1 8444 116600662 9856.7
##
## Step: AIC=9735.37
## distance ~ speed_ground + aircraft + height
##
## Df Sum of Sq RSS AIC
## + pitch 1 315259 100445017 9734.8
## <none> 100760276 9735.4
## + no_pasg 1 232003 100528273 9735.5
## + duration 1 3971 100756305 9737.3
##
## Step: AIC=9734.77
## distance ~ speed_ground + aircraft + height + pitch
##
## Df Sum of Sq RSS AIC
## <none> 100445017 9734.8
## + no_pasg 1 225608 100219409 9734.9
## + duration 1 6693 100438324 9736.7
##
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height + pitch,
## data = d)
##
## Coefficients:
## (Intercept) speed_ground aircraftboeing height pitch
## -2664.32 42.43 481.27 14.09 39.61
stepAIC(baseModel,direction="forward",scope=list(upper=topModel,lower=baseModel))## Start: AIC=11299.8
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_ground 1 500382567 166457762 10148
## + aircraft 1 37818390 629021939 11253
## + height 1 6590108 660250221 11294
## + pitch 1 5050617 661789712 11296
## + duration 1 1671325 665169005 11300
## <none> 666840329 11300
## + no_pasg 1 210253 666630076 11302
##
## Step: AIC=10148.53
## distance ~ speed_ground
##
## Df Sum of Sq RSS AIC
## + aircraft 1 49848656 116609106 9854.8
## + height 1 14916377 151541385 10072.5
## + pitch 1 9765095 156692668 10100.3
## <none> 166457762 10148.5
## + no_pasg 1 207528 166250234 10149.5
## + duration 1 49785 166407977 10150.3
##
## Step: AIC=9854.77
## distance ~ speed_ground + aircraft
##
## Df Sum of Sq RSS AIC
## + height 1 15848830 100760276 9735.4
## + pitch 1 455453 116153653 9853.5
## <none> 116609106 9854.8
## + no_pasg 1 87171 116521935 9856.1
## + duration 1 8444 116600662 9856.7
##
## Step: AIC=9735.37
## distance ~ speed_ground + aircraft + height
##
## Df Sum of Sq RSS AIC
## + pitch 1 315259 100445017 9734.8
## <none> 100760276 9735.4
## + no_pasg 1 232003 100528273 9735.5
## + duration 1 3971 100756305 9737.3
##
## Step: AIC=9734.77
## distance ~ speed_ground + aircraft + height + pitch
##
## Df Sum of Sq RSS AIC
## <none> 100445017 9734.8
## + no_pasg 1 225608 100219409 9734.9
## + duration 1 6693 100438324 9736.7
##
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height + pitch,
## data = d)
##
## Coefficients:
## (Intercept) speed_ground aircraftboeing height pitch
## -2664.32 42.43 481.27 14.09 39.61
The suggested model by stepAIC is to use aircraft, speed_ground, and height which is same as our previous model.
Final Model
print(summary(lm( distance ~ speed_ground + aircraft + height, data = data)))##
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -711.95 -226.73 -90.17 130.04 1471.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2512.2433 68.1974 -36.84 <2e-16 ***
## speed_ground 42.4024 0.6483 65.41 <2e-16 ***
## aircraftboeing 496.0452 24.2975 20.41 <2e-16 ***
## height 14.1478 1.2405 11.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 349.1 on 827 degrees of freedom
## Multiple R-squared: 0.8489, Adjusted R-squared: 0.8484
## F-statistic: 1549 on 3 and 827 DF, p-value: < 2.2e-16
\[ Landing Distance = -2512.24 + 42.4 * speed\_ground + 496.04 * Aircraft Type(Boeing) + 14.14 * height \]