Background: Flight landing.
Motivation: To reduce the risk of landing overrun.
Goal: To study what factors and how they would impact the landing distance of a commercial flight.
Data: Landing data (landing distance and other parameters) from 950 commercial flights (not real data set but simulated from statistical models).
Variable dictionary:| Variable_Name | Description |
|---|---|
| Aircraft | The make of an aircraft (Boeing or Airbus) |
| Duration (in minutes) | Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min |
| No_pasg | The number of passengers in a flight |
| Speed_ground (in miles per hour) | The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal |
| Speed_air (in miles per hour) | The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal |
| Height (in meters) | The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway |
| Pitch (in degrees) | Pitch angle of an aircraft when it is passing over the threshold of the runway |
| Distance (in feet) | The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet |
Loading the packages required:
library(readxl)
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(hrbrthemes)
Loading the dataset and checking the structure of the datasets:
FAA1 <- read_xls("C:\\Users\\arunp\\Desktop\\UC\\ACADEMICS\\SEMESTER 2\\7042-STATISTICAL MODELLING\\FAA1.xls")
FAA2 <- read_xls("C:\\Users\\arunp\\Desktop\\UC\\ACADEMICS\\SEMESTER 2\\7042-STATISTICAL MODELLING\\FAA2.xls")
#structure of data
str(FAA1)
## tibble [800 x 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:800] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:800] 109 103 NA NA NA ...
## $ height : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:800] 3370 2988 1145 1664 1050 ...
str(FAA2)
## tibble [150 x 7] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:150] 109 103 NA NA NA ...
## $ height : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:150] 3370 2988 1145 1664 1050 ...
The dataset FAA1 has got 800 observations and 8 variables.The dataset FAA2 has got 150 observations and 7 variables. FAA2 dataset doesn’t contain the information about duration of flights.
Merging the two datasets into a single dataset:
#merging datasets
FAA_merged <- bind_rows(FAA1,FAA2)
Let us now check for duplicate observations in the data.
#check for duplicates
duplicate_obs1 <- duplicated(FAA_merged %>% select(-duration))
print(paste("There are" ,sum(duplicate_obs1),"duplicate observations in the dataset."))
## [1] "There are 100 duplicate observations in the dataset."
We will drop these duplicate observations from our dataset.
#dropping duplicates
duplicates <- duplicate_obs1 %>% which()
FAA <- FAA_merged[-duplicates,]
Now, let us check the structure of the dataset again.
str(FAA)
## tibble [850 x 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:850] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:850] 109 103 NA NA NA ...
## $ height : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:850] 3370 2988 1145 1664 1050 ...
There are 850 observations and 8 variables in our dataset.
Let us check the summary statistics of each variable to have a better understanding.
summary(FAA)
## aircraft duration no_pasg speed_ground
## Length:850 Min. : 14.76 Min. :29.0 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90
## Mode :character Median :153.95 Median :60.0 Median : 79.64
## Mean :154.01 Mean :60.1 Mean : 79.45
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06
## Max. :305.62 Max. :87.0 Max. :141.22
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :101.15 Median :30.093 Median :4.008 Median :1258.09
## Mean :103.80 Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :642
Initial Summary of variables
From the dictionary of the dataset, we can see the description for normal values of each variable. Let us check for any abnormal values in the dataset and remove them.
attach(FAA)
FAA_normal <- FAA %>%
filter((duration > 40| is.na(duration)) & (speed_ground >= 30) & (speed_ground <= 140) &
((speed_air >= 30) & (speed_air <= 140)|is.na(speed_air)) &
(height >= 6) & (distance < 6000))
dim(FAA_normal)
## [1] 831 8
There are 831 observations with all the variables as normal values. We will remove the 19 abnormal observations.
#removing abnormal values
FAA <- FAA_normal
Since the variable speed_air contains almost 75% of missing data, imputing the data would not be a good idea.So, we are not imputing the missing values in the variable speed_air. But the variable duration contains only less number of missing values with respect to the sample size. We will impute the null values in the variable duration with mean value as the variable is of a main concern for us.
#imputing duration
FAA <- transform(FAA, duration = ifelse(is.na(duration), mean(duration, na.rm=TRUE),duration))
Now, let us check for structure and summary of the cleaned dataset.
str(FAA)
## 'data.frame': 831 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
The dataset contains 831 observations and 8 variables.
Let us check the summary of the variables.
summary(FAA)
## aircraft duration no_pasg speed_ground
## Length:831 Min. : 41.95 Min. :29.00 Min. : 33.57
## Class :character 1st Qu.:122.67 1st Qu.:55.00 1st Qu.: 66.20
## Mode :character Median :154.78 Median :60.00 Median : 79.79
## Mean :154.78 Mean :60.06 Mean : 79.54
## 3rd Qu.:186.37 3rd Qu.:65.00 3rd Qu.: 91.91
## Max. :305.62 Max. :87.00 Max. :132.78
##
## speed_air height pitch distance
## Min. : 90.00 Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.23 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.28
## Median :101.12 Median :30.167 Median :4.001 Median :1262.15
## Mean :103.48 Mean :30.458 Mean :4.005 Mean :1522.48
## 3rd Qu.:109.36 3rd Qu.:37.004 3rd Qu.:4.370 3rd Qu.:1936.63
## Max. :132.91 Max. :59.946 Max. :5.927 Max. :5381.96
## NA's :628
Let us plot the histograms for all numerical variables to have a better understanding about their distribution.
FAA_numeric <- FAA %>% keep(is.numeric)
FAA_numeric %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
All histograms except speed_air and distance shows normal distribution.
Summary after Data Cleaning
Computing the pairwise correlation between the landing distance and each factor X:
cor(FAA_numeric$distance,FAA_numeric,use = "pairwise.complete.obs")
## duration no_pasg speed_ground speed_air height pitch
## [1,] -0.05026941 -0.01775663 0.8662438 0.9420971 0.09941121 0.08702846
## distance
## [1,] 1
Table 1
| Variable_Name | Size_of_Correlation | Direction_of_correlation |
|---|---|---|
| Speed_Air | 0.9420971 | Positive |
| Speed_Ground | 0.8662438 | Positive |
| Height | 0.0994112 | Positive |
| Pitch | 0.0870285 | Positive |
| Duration | 0.0502694 | Negative |
| No. of Passengers | 0.0177566 | Negative |
Let us check for the scatter plots between distance and all other variables.
FAA_numeric %>%
gather(-distance, key = "var", value = "value") %>%
ggplot(aes(x = value, y = distance)) +
geom_point() +
facet_wrap(~ var, scales = "free") +
theme_bw()
The scatter plot results are in consistent with the correlation values we observed in Table 1.
As the airplane make is a categorical varaible, we did not include the variable in the initial analysis. We will code this character variable as 1 for the make “Boeing” and 0 for the make “Airbus” and include them in the analysis.
FAA <- transform(FAA, aircraft = ifelse(FAA$aircraft=="boeing",1,0))
cor(FAA$distance,FAA,use = "pairwise.complete.obs")
## aircraft duration no_pasg speed_ground speed_air height
## [1,] 0.2381445 -0.05026941 -0.01775663 0.8662438 0.9420971 0.09941121
## pitch distance
## [1,] 0.08702846 1
Let us also update Table 1 with the new variable:
Table 1(updated)| Variable_Name | Size_of_Correlation | Direction_of_correlation |
|---|---|---|
| Speed_Air | 0.9420971 | Positive |
| Speed_Ground | 0.8662438 | Positive |
| Aircraft | 0.2381445 | Positive |
| Height | 0.0994112 | Positive |
| Pitch | 0.0870285 | Positive |
| Duration | 0.0502694 | Negative |
| No. of Passengers | 0.0177566 | Negative |
Let us regress Y (landing distance) on each of the X variables and create a summary table with the findings.
model_aircraft <- lm(distance ~ aircraft)
model_duration <- lm(distance ~ duration)
model_nopsg <- lm(distance ~ no_pasg)
model_speed_grd <- lm(distance ~ speed_ground)
model_speed_air <- lm(distance ~ speed_air)
model_height <- lm(distance ~ height)
model_pitch <- lm(distance ~ pitch)
summary(model_aircraft)$coef[2,4]
summary(model_duration)$coef[2,4]
summary(model_nopsg)$coef[2,4]
summary(model_speed_grd)$coef[2,4]
summary(model_speed_air)$coef[2,4]
summary(model_height)$coef[2,4]
summary(model_pitch)$coef[2,4]
summary(model_aircraft)$coef[2,1]
summary(model_duration)$coef[2,1]
summary(model_nopsg)$coef[2,1]
summary(model_speed_grd)$coef[2,1]
summary(model_speed_air)$coef[2,1]
summary(model_height)$coef[2,1]
summary(model_pitch)$coef[2,1]
Table 2
| Variable_Name | Size_of_pvalue | Direction_of_Regressioncoeff |
|---|---|---|
| Speed_Ground | 0.0000000 | Positive |
| Speed_Air | 0.0000000 | Positive |
| Aircraft | 0.0000000 | Positive |
| Height | 0.0041239 | Positive |
| Pitch | 0.0120812 | Positive |
| Duration | 0.1514002 | Negative |
| No. of Passengers | 0.6092520 | Negative |
We will now standardize each X variable. In other words,we will create a new variable: X’= {X-mean(X)}/sd(X)}
FAA_std <- FAA
attach(FAA_std)
FAA_std <- transform(FAA_std, aircraft = (aircraft-mean(aircraft))/sd(aircraft))
FAA_std <- transform(FAA_std, duration = (duration-mean(duration))/sd(duration))
FAA_std <- transform(FAA_std, no_pasg = (no_pasg-mean(no_pasg))/sd(no_pasg))
FAA_std <- transform(FAA_std, speed_ground = (speed_ground-mean(speed_ground))/sd(speed_ground))
FAA_std <- transform(FAA_std, speed_air = (speed_air-mean(speed_air,na.rm = TRUE))/sd(speed_air,na.rm = TRUE))
FAA_std <- transform(FAA_std, height = (height-mean(height))/sd(height))
FAA_std <- transform(FAA_std, pitch = (pitch-mean(pitch))/sd(pitch))
We will now regress Y (landing distance) on each of these X’ variables and create a summary table with the findings.
model1_aircraft <- lm(FAA_std$distance ~ FAA_std$aircraft)
model1_duration <- lm(FAA_std$distance ~ FAA_std$duration)
model1_nopsg <- lm(FAA_std$distance ~ FAA_std$no_pasg)
model1_speed_grd <- lm(FAA_std$distance ~ FAA_std$speed_ground)
model1_speed_air <- lm(FAA_std$distance ~ FAA_std$speed_air)
model1_height <- lm(FAA_std$distance ~ FAA_std$height)
model1_pitch <- lm(FAA_std$distance ~ FAA_std$pitch)
summary(model1_aircraft)$coef[2,1]
summary(model1_duration)$coef[2,1]
summary(model1_nopsg)$coef[2,1]
summary(model1_speed_grd)$coef[2,1]
summary(model1_speed_air)$coef[2,1]
summary(model1_height)$coef[2,1]
summary(model1_pitch)$coef[2,1]
Table 3
| Variable_Name | Size_of_Regressioncoeff | Direction_of_Regressioncoeff |
|---|---|---|
| Speed_Ground | 776.44740 | Positive |
| Speed_Air | 774.34650 | Positive |
| Aircraft | 213.45800 | Positive |
| Height | 89.10606 | Positive |
| Pitch | 78.00693 | Positive |
| Duration | 45.05839 | Negative |
| No. of Passengers | 15.91595 | Negative |
We can observe that the results from the Tables 1,2 and 3 are almost consistent.Only the correlation table shows a higher value for speed_air than speed_ground. All other rankings are in the same order.
Let us provide a single table that ranks all the factors based on their relative importance in determining the landing distance.
Table 0| Variable_Name | Size_of_Regressioncoeff | Size_of_pvalue | Correlation_coeff |
|---|---|---|---|
| Speed_Ground | 776.44740 | 0.0000000 | 0.8662438 |
| Speed_Air | 774.34650 | 0.0000000 | 0.9420971 |
| Aircraft | 213.45800 | 0.0000000 | 0.2381445 |
| Height | 89.10606 | 0.0041239 | 0.0994112 |
| Pitch | 78.00693 | 0.0120812 | 0.0870285 |
| Duration | 45.05839 | 0.1514002 | 0.0502694 |
| No. of Passengers | 15.91595 | 0.6092520 | 0.0177566 |
Let us check three different models for collinearity:
#model1
summary(model_speed_grd)
##
## Call:
## lm(formula = distance ~ speed_ground)
##
## Residuals:
## Min 1Q Median 3Q Max
## -944.71 -328.81 -79.08 209.60 2413.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1810.5017 69.2987 -26.13 <2e-16 ***
## speed_ground 41.9940 0.8482 49.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 471 on 848 degrees of freedom
## Multiple R-squared: 0.743, Adjusted R-squared: 0.7427
## F-statistic: 2451 on 1 and 848 DF, p-value: < 2.2e-16
#model2
summary(model_speed_air)
##
## Call:
## lm(formula = distance ~ speed_air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -798.35 -193.18 3.68 216.25 817.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5718.404 201.994 -28.31 <2e-16 ***
## speed_air 82.175 1.937 42.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 285.9 on 206 degrees of freedom
## (642 observations deleted due to missingness)
## Multiple R-squared: 0.8973, Adjusted R-squared: 0.8968
## F-statistic: 1800 on 1 and 206 DF, p-value: < 2.2e-16
#model3
attach(FAA)
model3 <- lm(distance ~ speed_ground + speed_air)
summary(model3)
##
## Call:
## lm(formula = distance ~ speed_ground + speed_air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -819.74 -202.02 3.52 211.25 636.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5462.28 207.48 -26.327 < 2e-16 ***
## speed_ground -14.37 12.68 -1.133 0.258
## speed_air 93.96 12.89 7.291 6.99e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 200 degrees of freedom
## (628 observations deleted due to missingness)
## Multiple R-squared: 0.8883, Adjusted R-squared: 0.8871
## F-statistic: 795 on 2 and 200 DF, p-value: < 2.2e-16
We can observe that when we regress distance with both the variables, speed_ground becomes statistically insignificant.Also, the coefficient estimate of speed_ground becomes negative. This may be due to the high correlation between the two variables: speed_ground and speed_air.
Let us confirm this by checking the correlation.
cor(speed_air,speed_ground,use = "complete.obs")
## [1] 0.9879383
The result confirms the high correlation between speed_air and speed_ground. So, we will include either one of these variables in the model.Even though , R-squared value is more for the model with speed_air, as there are a lot of NULL values in the speed_air variable, I would keep only speed_ground in the model.
We would now build different models as per their ranking in Table 0.
attach(FAA)
model1 <- lm(distance ~ speed_ground)
model2 <- lm(distance ~ speed_ground + aircraft)
model3 <- lm(distance ~ speed_ground + aircraft + height)
model4 <- lm(distance ~ speed_ground + aircraft + height + pitch)
model5 <- lm(distance ~ speed_ground + aircraft + height + pitch + no_pasg)
model6 <- lm(distance ~ speed_ground + aircraft + height + pitch + no_pasg + duration)
Let us check how good the model is by checking R-squared of each model.
model1_rsquared <- summary(model1)$r.squared
model2_rsquared <- summary(model2)$r.squared
model3_rsquared <- summary(model3)$r.squared
model4_rsquared <- summary(model4)$r.squared
model5_rsquared <- summary(model5)$r.squared
model6_rsquared <- summary(model6)$r.squared
rsquared <- cbind(c(model1_rsquared,model2_rsquared,model3_rsquared,model4_rsquared,model5_rsquared,model6_rsquared),1:6)
rsquared <- as.data.frame(rsquared)
names(rsquared) <- c("Rsquared","Variables")
rsquared
## Rsquared Variables
## 1 0.7503784 1
## 2 0.8251319 2
## 3 0.8488989 3
## 4 0.8493717 4
## 5 0.8497100 5
## 6 0.8497162 6
Let us check how R-squared varies with different models as we increase the number of predictors.
rsquared %>%
ggplot(aes(x= Variables, y= Rsquared)) +
geom_line( color="grey") +
geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
scale_x_continuous("Variables", labels = as.character(rsquared$Variables), breaks = rsquared$Variables) +
theme_ipsum() +
ggtitle("R-squared with number of variables")
We can see that R-squared value increases as the number of variables increases. But after number of variables reach 3,the difference in R-squared is very less that the value almost remains constant.
When we are doing Multiple regression, Adjusted R-squared can be a better guide than R-squared as it even penalizes the model for adding more variables. Let us check the adjusted R-squared values for different models.
model1_adjrsquared <- summary(model1)$adj.r.squared
model2_adjrsquared <- summary(model2)$adj.r.squared
model3_adjrsquared <- summary(model3)$adj.r.squared
model4_adjrsquared <- summary(model4)$adj.r.squared
model5_adjrsquared <- summary(model5)$adj.r.squared
model6_adjrsquared <- summary(model6)$adj.r.squared
adjrsquared <- cbind(c(model1_adjrsquared,model2_adjrsquared,model3_adjrsquared,model4_adjrsquared,model5_adjrsquared,model6_adjrsquared),1:6)
adjrsquared <- as.data.frame(adjrsquared)
names(adjrsquared) <- c("AdjRsquared","Variables")
adjrsquared
## AdjRsquared Variables
## 1 0.7500773 1
## 2 0.8247095 2
## 3 0.8483508 3
## 4 0.8486423 4
## 5 0.8487992 5
## 6 0.8486219 6
Let us plot the adjusted R-squared values against the number of variables.
adjrsquared %>%
ggplot(aes(x= Variables, y= AdjRsquared)) +
geom_line( color="grey") +
geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
scale_x_continuous("Variables", labels = as.character(adjrsquared$Variables), breaks = adjrsquared$Variables) +
theme_ipsum() +
ggtitle("Adjusted R-squared with number of variables")
Adjusted R-squared also behaves almost similar to R-squared. The value increases upto 3 variables and then appears to remain constant.
AIC can also be a good indicator of a good model. Let us check for the AIC values for different models.
model1_aic <- AIC(model1)
model2_aic <- AIC(model2)
model3_aic <- AIC(model3)
model4_aic <- AIC(model4)
model5_aic <- AIC(model5)
model6_aic <- AIC(model6)
aic <- cbind(c(model1_aic,model2_aic,model3_aic,model4_aic,model5_aic,model6_aic),1:6)
aic <- as.data.frame(aic)
names(aic) <- c("AIC","Variables")
aic
## AIC Variables
## 1 12508.81 1
## 2 12215.05 2
## 3 12095.65 3
## 4 12095.05 4
## 5 12095.18 5
## 6 12097.14 6
Let us plot the variation in AIC values against the number of variables.
aic %>%
ggplot(aes(x= Variables, y= AIC)) +
geom_line( color="grey") +
geom_point(shape=21, color="black", fill="#69b3a2", size=6) +
scale_x_continuous("Variables", labels = as.character(aic$Variables), breaks = aic$Variables) +
theme_ipsum() +
ggtitle("AIC with number of variables")
We know that a better model has a lower AIC value. Here, AIC value decreases and even though the least value is for the model with number of variables 4, after the number of variables reach 3, the difference between the AIC values is minimal. After 3, the graph appears to level off.
So, observing the results from adjusted R-squared and AIC values , I would select the model with 3 variables i.e. regression of distance over the variables speed_ground,aircraft and height.Let us check the performance of this model.
summary(model3)
##
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -711.95 -226.73 -90.17 130.04 1471.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2512.2433 68.1974 -36.84 <2e-16 ***
## speed_ground 42.4024 0.6483 65.41 <2e-16 ***
## aircraft 496.0452 24.2975 20.41 <2e-16 ***
## height 14.1478 1.2405 11.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 349.1 on 827 degrees of freedom
## Multiple R-squared: 0.8489, Adjusted R-squared: 0.8484
## F-statistic: 1549 on 3 and 827 DF, p-value: < 2.2e-16
model3_aic
## [1] 12095.65
We can find that the adjusted R-squared value is 0.8484 which means the model can explain around 85% of variation in the response variable.
We will now use the “stepAIC” function to find out the model it suggests through forward variable selection.
library(MASS)
FAA_step <- FAA[,-5]
m1 <- lm(distance ~ 1,FAA_step)
m2 <- lm(distance~ . ,FAA_step)
f <- stepAIC(m1, direction="forward", scope=list(lower=m1, upper=m2))
## Start: AIC=11299.8
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_ground 1 500382567 166457762 10148
## + aircraft 1 37818390 629021939 11253
## + height 1 6590108 660250221 11294
## + pitch 1 5050617 661789712 11296
## + duration 1 1685114 665155215 11300
## <none> 666840329 11300
## + no_pasg 1 210253 666630076 11302
##
## Step: AIC=10148.53
## distance ~ speed_ground
##
## Df Sum of Sq RSS AIC
## + aircraft 1 49848656 116609106 9854.8
## + height 1 14916377 151541385 10072.5
## + pitch 1 9765095 156692668 10100.3
## <none> 166457762 10148.5
## + no_pasg 1 207528 166250234 10149.5
## + duration 1 51669 166406094 10150.3
##
## Step: AIC=9854.77
## distance ~ speed_ground + aircraft
##
## Df Sum of Sq RSS AIC
## + height 1 15848830 100760276 9735.4
## + pitch 1 455453 116153653 9853.5
## <none> 116609106 9854.8
## + no_pasg 1 87171 116521935 9856.1
## + duration 1 8445 116600661 9856.7
##
## Step: AIC=9735.37
## distance ~ speed_ground + aircraft + height
##
## Df Sum of Sq RSS AIC
## + pitch 1 315259 100445017 9734.8
## <none> 100760276 9735.4
## + no_pasg 1 232003 100528273 9735.5
## + duration 1 3976 100756300 9737.3
##
## Step: AIC=9734.77
## distance ~ speed_ground + aircraft + height + pitch
##
## Df Sum of Sq RSS AIC
## <none> 100445017 9734.8
## + no_pasg 1 225608 100219409 9734.9
## + duration 1 6696 100438321 9736.7
We can see that stepAIC suggests a regression model with the 4 variables(distance ~ speed_ground + aircraft + height + pitch).This is in consistence with what we observed from the graph for AIC values. The model with 4 variables have got the lowest AIC value.
Let us check for the summary of the regression model with these variables.
summary(lm(distance ~ speed_ground + aircraft + height + pitch))
##
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height + pitch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -716.81 -224.12 -93.24 127.80 1500.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2664.3223 116.4605 -22.88 <2e-16 ***
## speed_ground 42.4283 0.6479 65.49 <2e-16 ***
## aircraft 481.2682 25.9512 18.55 <2e-16 ***
## height 14.0909 1.2398 11.37 <2e-16 ***
## pitch 39.6076 24.5991 1.61 0.108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 348.7 on 826 degrees of freedom
## Multiple R-squared: 0.8494, Adjusted R-squared: 0.8486
## F-statistic: 1164 on 4 and 826 DF, p-value: < 2.2e-16
We can observe that the variable pitch is statistically insignificant in the model. So as per our conclusions from the Adjusted Rsquared, I would consider the model with 3 variables i.e. regression of distance over the variables speed_ground,aircraft and height as the final model.