Landing Distance Assessment

1.0 Case Overview

What is Landing Distance and why is it imp ?

Definition Landing Distance is the horizontal distance traversed by the aeroplane by the aeroplane from a point on the approach path at a selected height above the landing surface to the point on the landing surface at which the aeroplane comes to a complete stop.

(Source: ICAO Annex 8 Part IIIA Paragraph 2.2.3.3. and Part IIIB Sub-part B Paragraph B2.7 e)

Factors Affecting Actual Landing Distance

Variable dictionary:

  • Aircraft: The make of an aircraft (Boeing or Airbus).
  • Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
  • No_pasg: The number of passengers in a flight.
  • Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
  • Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
  • Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
  • Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
  • Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

2.0 Initial Set Up

2.1 Package Loading

Required Packages

  • Tidyverse (dplyr, ggplot2..) - Data Read, Manipulation and visualisation
  • Plotly - Interactive Visualization
  • KableExtra - Styling for table (Styling Data Tables within Markdown)
  • gridExtra- Graphical arrangement
  • forecast- Time series and forecasting
  • ggplot2- Graphical representation
  • stringr- string manipulations
  • corrplot-making correlogram
  • knitrr- Dynamic report generation

2.2 Data Loading

Read the two files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ into your R system. Please search “Read Excel files from R” in Google in case you do not know how to do that.

STEP 01

# Read the sheets, one by one
FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")
dim(FAA1)
## [1] 800   8
dim(FAA2)
## [1] 150   7

STEP 02

Check the structure of each data set using the “str” function. For each data set, what is the sample size and how many variables? Is there any difference between the two data sets?

str(FAA1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    800 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
str(FAA2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    150 obs. of  7 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
  • FAA1 : 800 Rows and 8 columns , sample size is 800 rows

  • FAA2 : 150 Rows and 7 columns ,sample size is 150 rows

  • FAA2 does not have duration column

  • FAA1 AND FAA2 have similar structure, datatypes.

2.3 Data Merge

In order to combine the 2 data sets we need to make the sructure of both same, by structure we refer to columns.Hence we need to add a column duration to FAA2 and then use rbind function to find 150 +800 =950 rows sample size=950

Merge the 2 data sets FAA1 & 2 to create a composite dataset

STEP 03

FAA2$duration<-NA
head(FAA2)
## # A tibble: 6 x 8
##   aircraft no_pasg speed_ground speed_air height pitch distance duration
##   <chr>      <dbl>        <dbl>     <dbl>  <dbl> <dbl>    <dbl> <lgl>   
## 1 boeing        53        108.       109.   27.4  4.04    3370. NA      
## 2 boeing        69        102.       103.   27.8  4.12    2988. NA      
## 3 boeing        61         71.1       NA    18.6  4.43    1145. NA      
## 4 boeing        56         85.8       NA    30.7  3.88    1664. NA      
## 5 boeing        70         59.9       NA    32.4  4.03    1050. NA      
## 6 boeing        55         75.0       NA    41.2  4.20    1627. NA
FAA_merge<-rbind(FAA1,FAA2)
str(FAA_merge)
## Classes 'tbl_df', 'tbl' and 'data.frame':    950 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
dim(FAA_merge)
## [1] 950   8

In order to combine the 2 data sets we need to make the sructure of both same, by structure we refer to columns.Hence we need to add a column duration to FAA2 and then use rbind function to find 150 +800 =950 rows #sample size=950

STEP 04 Check the structure of the combined data set. What is the sample size and how many variables? Provide summary statistics for each variable.

#FAA_merge %>% distinct()
row_dup<-duplicated(FAA_merge[,-2])
class(row_dup)
## [1] "logical"
FAA_U<-FAA_merge[!row_dup,]
dim(FAA_U)
## [1] 850   8

Duplicate Check: Because the FAA2 has NA in duration column , we need to check for duplicates rows excluding that column. We need to remove the duplicate rows if we find any to reduce redundancy

dim(FAA_U)
## [1] 850   8
str(FAA_U)
## Classes 'tbl_df', 'tbl' and 'data.frame':    850 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
summary(FAA_U)
##    aircraft            duration         no_pasg      speed_ground   
##  Length:850         Min.   : 14.76   Min.   :29.0   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90  
##  Mode  :character   Median :153.95   Median :60.0   Median : 79.64  
##                     Mean   :154.01   Mean   :60.1   Mean   : 79.45  
##                     3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06  
##                     Max.   :305.62   Max.   :87.0   Max.   :141.22  
##                     NA's   :50                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.25   1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :101.15   Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :103.80   Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:109.40   3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :642

Summary Statistics: we find 850 unique rows and 8 columns. Below is the summary statistics of each column in dataset

STEP 05

KEY INSIGHTS:

  • Duration column in FAA2 file is missing
  • Speed air column has 642 NA Values
  • duration,distance and height columns have considerable difference between Minimum and Maximum value
  • height column has a negative value, which cannot be legit because height is bound to be positive.Hence it can be a data reading issue.

3.0 Data Cleaning

STEP 06

Are there abnormal values in the data set? Please refer to the variable dictionary for criteria defining “normal/abnormal” values. Remove the rows that contain any “abnormal values” and report how many rows you have removed.

FAA_U<-FAA_U[FAA_U$duration>=40,]
FAA_U<-FAA_U[FAA_U$speed_ground>=30 && FAA_U$speed_ground<=140,]
FAA_U<-FAA_U[FAA_U$height>=6,]
FAA_U<-FAA_U[FAA_U$distance<=6000,]
nrow(FAA_U)
## [1] 833
sum(is.na(FAA_U$duration))
## [1] 50
sum(is.na(FAA_U$duration))
## [1] 50
sum(is.na(FAA_U$speed_ground))
## [1] 50
sum(is.na(FAA_U$height))
## [1] 50
sum(is.na(FAA_U$distance))
## [1] 50
#Remove all rowa that have NA in all the columns
FAA_U_RmvNA<-FAA_U %>% filter(!(is.na(aircraft)&is.na(duration)&is.na(no_pasg)&is.na(speed_ground)&is.na(speed_air)&is.na(height)&is.na(pitch)&is.na(distance)))
#nrow(FAA_U_RmvNA)
#Looking for NA Values in columns
paste("Number of NA in AIRCRAFT:" , sum(is.na(FAA_U_RmvNA$aircraft)))
## [1] "Number of NA in AIRCRAFT: 0"
paste("Number of NA in DURATION:" , sum(is.na(FAA_U_RmvNA$duration)))
## [1] "Number of NA in DURATION: 0"
paste("Number of NA in NO_PASG:" , sum(is.na(FAA_U_RmvNA$no_pasg)))
## [1] "Number of NA in NO_PASG: 0"
paste("Number of NA in SPEED GROUND:" , sum(is.na(FAA_U_RmvNA$speed_ground)))
## [1] "Number of NA in SPEED GROUND: 0"
paste("Number of NA in SPEED AIR:" , sum(is.na(FAA_U_RmvNA$speed_air)))
## [1] "Number of NA in SPEED AIR: 588"
paste("Number of NA in HEIGHT:" , sum(is.na(FAA_U_RmvNA$height)))
## [1] "Number of NA in HEIGHT: 0"
paste("Number of NA in PITCH:" , sum(is.na(FAA_U_RmvNA$pitch)))
## [1] "Number of NA in PITCH: 0"
paste("Number of NA in DISTANCE:" , sum(is.na(FAA_U_RmvNA$distance)))
## [1] "Number of NA in DISTANCE: 0"
nrow(FAA_U_RmvNA)
## [1] 783

As first steps in Data cleaning we need to filter the rows that do not qualify for data analysis.The are 2 major categories of such line items: Abnormal values: We need to remove abnormal values(values that do not qualify for analysis, as suggested by SMEs).Below code removes them NA Rows: We need to remove NA rows(where all cell values are NA).

  • Conclusion: we removed close to 850 -783= 67 rows.
  • Abnormal values: 17
  • NA VALUE Rows: 50

###3.1 Summary Statistics:

STEP 07

#Sample Size:783
summary(FAA_U_RmvNA)  
##    aircraft            duration         no_pasg       speed_ground   
##  Length:783         Min.   : 41.95   Min.   :29.00   Min.   : 27.74  
##  Class :character   1st Qu.:119.67   1st Qu.:55.00   1st Qu.: 66.01  
##  Mode  :character   Median :154.28   Median :60.00   Median : 79.75  
##                     Mean   :154.83   Mean   :60.07   Mean   : 79.51  
##                     3rd Qu.:189.75   3rd Qu.:65.00   3rd Qu.: 92.13  
##                     Max.   :305.62   Max.   :87.00   Max.   :132.78  
##                                                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.15   1st Qu.:23.562   1st Qu.:3.654   1st Qu.: 919.67  
##  Median :100.89   Median :30.203   Median :4.017   Median :1273.66  
##  Mean   :103.50   Mean   :30.438   Mean   :4.015   Mean   :1540.33  
##  3rd Qu.:109.42   3rd Qu.:36.984   3rd Qu.:4.385   3rd Qu.:1960.41  
##  Max.   :132.91   Max.   :59.946   Max.   :5.927   Max.   :5381.96  
##  NA's   :588
  • Key Insight:

  • Speed Air: We observe that Speed Air still has a considerable number of NA Values.Close to 75% which can be a problem during analysis

  • distance: considerable difference between minimum and Maximum values of distance variable.

###3.2 Univariate analysis: STEP 08 Since you have a small set of variables, you may want to show histograms for all of them.

for(i in 2:ncol(FAA_U_RmvNA)){
  w<-as.numeric(unlist(round(FAA_U_RmvNA[,i],2)))
  par(mfrow=c(2,1))
  hist(w,main = paste("Histogram of" , colnames(FAA_U_RmvNA)[i]))
  qqnorm(w,main = paste("QQ PLOT of" , colnames(FAA_U_RmvNA)[i]))
  qqline(w, col='red',)
}

STEP 09

  • Key Insights:
  • We removed 17 abnormal value Rows from dataset.
  • We removed 50 rows with NA in all cell values.
  • We still observe that 75% of data in duration column is missing **
  • Univariate analysis reveals that duration, no_pasg,speed_ground,height, pitch attributes have asymptotically normal distribution, where as distance and speed air are right skewed. We can verify this from QQ-Plots.We observes clearly that speed air and distance have right skewed data distribution.

4.0 Data Analysis

4.1 Correlation Analysis

Initial analysis for identifying important factors that impact the #response variable “landing distance” STEP 10

Compute the pairwise correlation between the landing distance and each factor X. Provide a table that ranks the factors based on the size (absolute value) of the correlation. This table contains three columns: the names of variables, the size of the correlation, the direction of the correlation (positive or negative). We call it Table 1, which will be used for comparison with our analysis later

#Correlation table

FAA_U_RmvNA[,1]<-ifelse(FAA_U_RmvNA[,1]=='boeing',0,1)
#FAA_U_RmvNA$aircraft<-as.factor(FAA_U_RmvNA$aircraft)

str(FAA_U_RmvNA)
## Classes 'tbl_df', 'tbl' and 'data.frame':    783 obs. of  8 variables:
##  $ aircraft    : num [1:783, 1] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr "aircraft"
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
cor_vec <- vector("numeric")
abs_vec <- vector("character")
for(i in 1:8){
  if(i==1){
    cor_vec[i]<-cor(FAA_U_RmvNA[,i],FAA_U_RmvNA[,8])
  }
  if(i==2|i==3|i==4|i==6|i==7|i==8){
  cor_vec[i]<-cor(FAA_U_RmvNA[,i],FAA_U_RmvNA[,8])
  }
  if(i==5){
        f<-na.omit(FAA_U_RmvNA)
        cor_vec[i] <-round(cor(f[,5],f[,8]),3)
  }
  abs_vec[i]<-ifelse(sign(cor_vec[i])==-1,"Negative","Positive")
  
}
df<-data.frame(cbind(noquote(colnames(FAA_U_RmvNA)),noquote(round(cor_vec,3)),noquote(abs_vec)))
colnames(df)[1] <- "Attribute"
colnames(df)[2] <- "Correlation"
colnames(df)[3] <- "Sign"
df
##      Attribute Correlation     Sign
## 1     aircraft      -0.229 Negative
## 2     duration      -0.052 Negative
## 3      no_pasg      -0.016 Negative
## 4 speed_ground       0.862 Positive
## 5    speed_air       0.943 Positive
## 6       height       0.104 Positive
## 7        pitch       0.068 Positive
## 8     distance           1 Positive
FAA_U_RmvNA$aircraft<-as.factor(FAA_U_RmvNA$aircraft)

STEP 11

Show X-Y scatter plots. Do you think the correlation strength observed in these plots is consistent with the values computed in Step 10

par(mfrow=c(2,2))
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$aircraft,col="blue",xlab="distance",ylab="Aircraft")
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$duration,col="blue",xlab="distance",ylab="duration")
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$no_pasg,col="blue",xlab="distance",ylab="no_pasg")
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$speed_ground,col="blue",xlab="distance",ylab="speed_ground")

par(mfrow=c(2,2))

plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$speed_air,col="blue",xlab="distance",ylab="speed_air")
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$height,col="blue",xlab="distance",ylab="height")
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$pitch,col="blue",xlab="distance",ylab="pitch")

* Key Insight: We observe that Speed ground and Speed Air have a strong correlation with Landing distance.All other variables appear to be random.

STEP 12

Have you included the airplane make as a possible factor in Steps 10-11? You can code this character variable as 0/1.

Yes, aircraft type is also included in the data analysis by converting the values of categories to 0 and 1

boeing =0, airbus=1.

4.2 Regression Analysis

STEP 13

Regress Y (landing distance) on each of the X variables. Provide a table that ranks the factors based on its significance. The smaller the p-value, the more significant the factor. This table contains three columns: the names of variables, the size of the p-value, the direction of the regression coefficient (positive or negative). We call it Table 2.

using a single factor each time

fit1<-lm(distance~aircraft,FAA_U_RmvNA)
fit2<-lm(distance~duration,FAA_U_RmvNA)
fit3<-lm(distance~no_pasg,FAA_U_RmvNA)
fit4<-lm(distance~speed_ground,FAA_U_RmvNA)
fit5<-lm(distance~speed_air,FAA_U_RmvNA)
fit6<-lm(distance~height,FAA_U_RmvNA)
fit7<-lm(distance~pitch,FAA_U_RmvNA)

summary(fit1)
## 
## Call:
## lm(formula = distance ~ aircraft, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1293.4  -636.3  -233.0   390.5  3633.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1748.15      44.63  39.170  < 2e-16 ***
## aircraft1    -413.00      62.92  -6.564 9.52e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 880.2 on 781 degrees of freedom
## Multiple R-squared:  0.05229,    Adjusted R-squared:  0.05108 
## F-statistic: 43.09 on 1 and 781 DF,  p-value: 9.522e-11
summary(fit2)#duration
## 
## Call:
## lm(formula = distance ~ duration, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1463.7  -614.1  -273.7   408.9  3848.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1690.9392   108.3535  15.606   <2e-16 ***
## duration      -0.9727     0.6681  -1.456    0.146    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 903 on 781 degrees of freedom
## Multiple R-squared:  0.002707,   Adjusted R-squared:  0.00143 
## F-statistic:  2.12 on 1 and 781 DF,  p-value: 0.1458
summary(fit3)
## 
## Call:
## lm(formula = distance ~ no_pasg, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1465.5  -629.0  -263.6   411.8  3865.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1657.903    259.783   6.382    3e-10 ***
## no_pasg       -1.957      4.291  -0.456    0.648    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 904.1 on 781 degrees of freedom
## Multiple R-squared:  0.0002663,  Adjusted R-squared:  -0.001014 
## F-statistic: 0.208 on 1 and 781 DF,  p-value: 0.6484
summary(fit4)#speed ground
## 
## Call:
## lm(formula = distance ~ speed_ground, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -918.86 -322.87  -75.58  209.68 1900.61 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1711.1215    70.3255  -24.33   <2e-16 ***
## speed_ground    40.8941     0.8602   47.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 458.2 on 781 degrees of freedom
## Multiple R-squared:  0.7432, Adjusted R-squared:  0.7429 
## F-statistic:  2260 on 1 and 781 DF,  p-value: < 2.2e-16
summary(fit5)#speed air
## 
## Call:
## lm(formula = distance ~ speed_air, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -783.22 -189.61    2.73  215.76  623.27 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5417.607    208.860  -25.94   <2e-16 ***
## speed_air      79.244      2.009   39.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.4 on 193 degrees of freedom
##   (588 observations deleted due to missingness)
## Multiple R-squared:  0.8897, Adjusted R-squared:  0.8891 
## F-statistic:  1556 on 1 and 193 DF,  p-value: < 2.2e-16
summary(fit6)#height
## 
## Call:
## lm(formula = distance ~ height, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1347.5  -612.0  -248.3   410.4  3921.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1245.566    105.578  11.798  < 2e-16 ***
## height         9.684      3.304   2.931  0.00348 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 899.3 on 781 degrees of freedom
## Multiple R-squared:  0.01088,    Adjusted R-squared:  0.009614 
## F-statistic: 8.591 on 1 and 781 DF,  p-value: 0.003477
summary(fit7)
## 
## Call:
## lm(formula = distance ~ pitch, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1355.6  -646.9  -252.4   403.1  3826.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1068.2      250.2   4.269  2.2e-05 ***
## pitch          117.6       61.8   1.903   0.0574 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 902.1 on 781 degrees of freedom
## Multiple R-squared:  0.004615,   Adjusted R-squared:  0.003341 
## F-statistic: 3.621 on 1 and 781 DF,  p-value: 0.05742
inte <- vector("numeric")
pvalue <- vector("numeric")
coeff<-vector("numeric")
icoeff<-vector("numeric")

#Intercept:
inte[1]<-summary(fit1)$coefficients[1,4]
inte[2]<-summary(fit2)$coefficients[1,4]
inte[3]<-summary(fit3)$coefficients[1,4] 
inte[4]<-summary(fit4)$coefficients[1,4] 
inte[5]<-summary(fit5)$coefficients[1,4] 
inte[6]<-summary(fit6)$coefficients[1,4] 
inte[7]<-summary(fit7)$coefficients[1,4] 

pvalue[1]<-summary(fit1)$coefficients[2,4]
pvalue[2]<-summary(fit2)$coefficients[2,4]
pvalue[3]<-summary(fit3)$coefficients[2,4] 
pvalue[4]<-summary(fit4)$coefficients[2,4] 
pvalue[5]<-summary(fit5)$coefficients[2,4] 
pvalue[6]<-summary(fit6)$coefficients[2,4] 
pvalue[7]<-summary(fit7)$coefficients[2,4] 

coeff[1]<-summary(fit1)$coefficients[2,1]
coeff[2]<-summary(fit2)$coefficients[2,1]
coeff[3]<-summary(fit3)$coefficients[2,1] 
coeff[4]<-summary(fit4)$coefficients[2,1] 
coeff[5]<-summary(fit5)$coefficients[2,1] 
coeff[6]<-summary(fit6)$coefficients[2,1] 
coeff[7]<-summary(fit7)$coefficients[2,1] 

icoeff[1]<-summary(fit1)$coefficients[1,1]
icoeff[2]<-summary(fit2)$coefficients[1,1]
icoeff[3]<-summary(fit3)$coefficients[1,1] 
icoeff[4]<-summary(fit4)$coefficients[1,1] 
icoeff[5]<-summary(fit5)$coefficients[1,1] 
icoeff[6]<-summary(fit6)$coefficients[1,1] 
icoeff[7]<-summary(fit7)$coefficients[1,1] 

length(colnames(FAA_U_RmvNA)[1:7])
## [1] 7
length(inte)
## [1] 7
length(pvalue)
## [1] 7
length(coeff)
## [1] 7
length(icoeff)
## [1] 7
sign_rc<-vector("character")
sign_rc<-ifelse(sign(coeff)==-1,"Negative","Positive")


df2<-data.frame(colnames(FAA_U_RmvNA)[1:7],ifelse(round(inte)<.0001,'<.0001',round(inte)),ifelse(round(pvalue,5)<0.0001,'<.0001',round(pvalue,5)),icoeff,round(coeff,2),sign_rc)
colnames(df2)[1] <- "Attribute"
colnames(df2)[2] <- "Beta0Pvl"
colnames(df2)[3] <- "Beta1Pvl"
colnames(df2)[4] <- "Beta0"
colnames(df2)[5] <- "Beta1"
colnames(df2)[6] <- "Sign_rc"

arrange(df2,Beta1Pvl)
##      Attribute Beta0Pvl Beta1Pvl     Beta0   Beta1  Sign_rc
## 1     aircraft   <.0001   <.0001  1748.152 -413.00 Negative
## 2 speed_ground   <.0001   <.0001 -1711.121   40.89 Positive
## 3    speed_air   <.0001   <.0001 -5417.607   79.24 Positive
## 4       height   <.0001  0.00348  1245.566    9.68 Positive
## 5        pitch   <.0001  0.05742  1068.187  117.59 Positive
## 6     duration   <.0001  0.14579  1690.939   -0.97 Negative
## 7      no_pasg   <.0001  0.64844  1657.903   -1.96 Negative

Insight: as per pavalue analysis we observe that aircraft,speed_ground,speed_air,height,pitch(borderline) are significant predictors to landing distance.This conclusion gives added information to correlation analysis that only showed that speed_ground amd speed air are associated with Landing distance.

4.3 Scaled variable Analysis

STEP 14

Standardize each X variable. In other words, create a new variable X’= {X-mean(X)}/sd(X). The mean of X’ is 0 and its standard deviation is 1. Regress Y (landing distance) on each of the X’ variables. Provide a table that ranks the factors based on the size of the regression coefficient. The larger the size, the more important the factor. This table contains three columns: the names of variables, the size of the regression coefficient, the direction of the regression coefficient (positive or negative). We call it Table 3.

#create a new dataframe
FAA_U_RmvNA_SD<-FAA_U_RmvNA
#scale function to standardize variables

#(FAA_U_RmvNA_SD$aircraft<-round(scale(FAA_U_RmvNA$aircraft),3))#ignoring aircraft as of now##############################
FAA_U_RmvNA_SD$duration<-round(scale(FAA_U_RmvNA$duration),3)
FAA_U_RmvNA_SD$no_pasg<-round(scale(FAA_U_RmvNA$no_pasg),3)
FAA_U_RmvNA_SD$speed_ground<-round(scale(FAA_U_RmvNA$speed_ground),3)
FAA_U_RmvNA_SD$speed_air<-round(scale(FAA_U_RmvNA$speed_air),3)
FAA_U_RmvNA_SD$height<-round(scale(FAA_U_RmvNA$height),3)
FAA_U_RmvNA_SD$pitch<-round(scale(FAA_U_RmvNA$pitch),3)
#(FAA_U_RmvNA_SD$distance<-round(scale(FAA_U_RmvNA$distance),3))

fits1<-lm(distance~aircraft,FAA_U_RmvNA_SD)#not scaled
fits2<-lm(distance~duration,FAA_U_RmvNA_SD)
fits3<-lm(distance~no_pasg,FAA_U_RmvNA_SD)
fits4<-lm(distance~speed_ground,FAA_U_RmvNA_SD)
fits5<-lm(distance~speed_air,FAA_U_RmvNA_SD)
fits6<-lm(distance~height,FAA_U_RmvNA_SD)
fits7<-lm(distance~pitch,FAA_U_RmvNA_SD)

intes <- vector("numeric")
pvalues <- vector("numeric")
coeffs<-vector("numeric")
icoeffs<-vector("numeric")

#Intercept:
intes[1]<-summary(fits1)$coefficients[1,4]
intes[2]<-summary(fits2)$coefficients[1,4]
intes[3]<-summary(fits3)$coefficients[1,4] 
intes[4]<-summary(fits4)$coefficients[1,4] 
intes[5]<-summary(fits5)$coefficients[1,4] 
intes[6]<-summary(fits6)$coefficients[1,4] 
intes[7]<-summary(fits7)$coefficients[1,4] 

pvalues[1]<-summary(fits1)$coefficients[2,4]
pvalues[2]<-summary(fits2)$coefficients[2,4]
pvalues[3]<-summary(fits3)$coefficients[2,4] 
pvalues[4]<-summary(fits4)$coefficients[2,4] 
pvalues[5]<-summary(fits5)$coefficients[2,4] 
pvalues[6]<-summary(fits6)$coefficients[2,4] 
pvalues[7]<-summary(fits7)$coefficients[2,4] 

coeffs[1]<-summary(fits1)$coefficients[2,1]
coeffs[2]<-summary(fits2)$coefficients[2,1]
coeffs[3]<-summary(fits3)$coefficients[2,1] 
coeffs[4]<-summary(fits4)$coefficients[2,1] 
coeffs[5]<-summary(fits5)$coefficients[2,1] 
coeffs[6]<-summary(fits6)$coefficients[2,1] 
coeffs[7]<-summary(fits7)$coefficients[2,1] 

icoeffs[1]<-summary(fits1)$coefficients[1,1]
icoeffs[2]<-summary(fits2)$coefficients[1,1]
icoeffs[3]<-summary(fits3)$coefficients[1,1] 
icoeffs[4]<-summary(fits4)$coefficients[1,1] 
icoeffs[5]<-summary(fits5)$coefficients[1,1] 
icoeffs[6]<-summary(fits6)$coefficients[1,1] 
icoeffs[7]<-summary(fits7)$coefficients[1,1] 


sign_rcs<-vector("character")
sign_rcs<-ifelse(sign(coeffs)==-1,"Negative","Positive")

df3<-data.frame(colnames(FAA_U_RmvNA)[1:7],ifelse(round(intes)<.0001,'<.0001',round(intes)),ifelse(round(pvalues,5)<0.0001,'<.0001',round(pvalues,5)),icoeffs,round(coeffs,2),sign_rcs)
colnames(df3)[1] <- "Attribute"
colnames(df3)[2] <- "Beta0Pvl"
colnames(df3)[3] <- "Beta1Pvl"
colnames(df3)[4] <- "Beta0"
colnames(df3)[5] <- "Beta1"
colnames(df3)[6] <- "Sign_Beta1"

arrange(df3,desc(Beta1))
##      Attribute Beta0Pvl Beta1Pvl    Beta0   Beta1 Sign_Beta1
## 1    speed_air   <.0001   <.0001 2784.512  782.94   Positive
## 2 speed_ground   <.0001   <.0001 1540.318  779.00   Positive
## 3       height   <.0001  0.00348 1540.335   94.24   Positive
## 4        pitch   <.0001  0.05743 1540.333   61.38   Positive
## 5      no_pasg   <.0001  0.64815 1540.333  -14.76   Negative
## 6     duration   <.0001  0.14569 1540.334  -47.03   Negative
## 7     aircraft   <.0001   <.0001 1748.152 -413.00   Negative

Insight: * As per scaled variable analysis speed air, speed ground, height, pitch(pvalue-borderline) are significant. * This analysis is in line with out previous analysis of unscalled variables. * We further observe that after scaling the coefficients become more prominent and larger that unscaled coefficents. * Because Pitch variable is on borderline, for this analysis we choose to consider it insignificant. A deeper analysis to understand to its significance with more data can help us be sure about its significance to Landing Distance.

Analysis table STEP 15

Compare Tables 1,2,3. Are the results consistent? At this point, you will meet with a FAA agent again. Please provide a single table than ranks all the factors based on their relative importance in determining the landing distance. We call it Table 0.

#Table 0
#arrange by p value for significance
df4<-arrange(df3,Beta1Pvl)
#TABLE 0 :
#select coefficients for intercept and slope
df0<-select(df4,Attribute,Beta0,Beta1) 

4.4 Check collinearity

STEP 16

Compare the regression coefficients of the three models below: Model 1: LD ~ Speed_ground Model 2: LD ~ Speed_air Model 3: LD ~ Speed_ground + Speed_air

model1<-lm(FAA_U_RmvNA$distance~FAA_U_RmvNA$speed_ground)
model2<-lm(FAA_U_RmvNA$distance~FAA_U_RmvNA$speed_air)
model3<-lm(FAA_U_RmvNA$distance~FAA_U_RmvNA$speed_ground+FAA_U_RmvNA$speed_air)#sg insig
summary(model1)
## 
## Call:
## lm(formula = FAA_U_RmvNA$distance ~ FAA_U_RmvNA$speed_ground)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -918.86 -322.87  -75.58  209.68 1900.61 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -1711.1215    70.3255  -24.33   <2e-16 ***
## FAA_U_RmvNA$speed_ground    40.8941     0.8602   47.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 458.2 on 781 degrees of freedom
## Multiple R-squared:  0.7432, Adjusted R-squared:  0.7429 
## F-statistic:  2260 on 1 and 781 DF,  p-value: < 2.2e-16
summary(model2)
## 
## Call:
## lm(formula = FAA_U_RmvNA$distance ~ FAA_U_RmvNA$speed_air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -783.22 -189.61    2.73  215.76  623.27 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -5417.607    208.860  -25.94   <2e-16 ***
## FAA_U_RmvNA$speed_air    79.244      2.009   39.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.4 on 193 degrees of freedom
##   (588 observations deleted due to missingness)
## Multiple R-squared:  0.8897, Adjusted R-squared:  0.8891 
## F-statistic:  1556 on 1 and 193 DF,  p-value: < 2.2e-16
summary(model3)
## 
## Call:
## lm(formula = FAA_U_RmvNA$distance ~ FAA_U_RmvNA$speed_ground + 
##     FAA_U_RmvNA$speed_air)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -820.6 -182.0    7.7  204.2  633.0 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -5425.49     209.08 -25.950  < 2e-16 ***
## FAA_U_RmvNA$speed_ground   -12.32      12.98  -0.949    0.344    
## FAA_U_RmvNA$speed_air       91.63      13.20   6.941 5.82e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.5 on 192 degrees of freedom
##   (588 observations deleted due to missingness)
## Multiple R-squared:  0.8902, Adjusted R-squared:  0.889 
## F-statistic: 778.1 on 2 and 192 DF,  p-value: < 2.2e-16
#colinearity check between speed ground and speed air
vif(model3)
## FAA_U_RmvNA$speed_ground    FAA_U_RmvNA$speed_air 
##                  43.1606                  43.1606
faa<-na.omit(FAA_U_RmvNA)
#correlation analysis
cor(faa$speed_ground,faa$speed_air)#99 percent
## [1] 0.9883475
cor(FAA_U_RmvNA$speed_ground,FAA_U_RmvNA$distance)#86
## [1] 0.8620846
cor(faa$distance,faa$speed_air)#94
## [1] 0.943219
Model_lm_sa <- lm(distance ~ aircraft+speed_ground+speed_air+height, data=FAA_U_RmvNA) 
summary(Model_lm_sa)#sg insig
## 
## Call:
## lm(formula = distance ~ aircraft + speed_ground + speed_air + 
##     height, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -288.21  -94.55   12.00   85.48  336.21 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5955.217    109.297 -54.486   <2e-16 ***
## aircraft1     -433.427     19.806 -21.884   <2e-16 ***
## speed_ground    -3.791      6.327  -0.599     0.55    
## speed_air       85.846      6.430  13.351   <2e-16 ***
## height          13.750      1.036  13.268   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.5 on 190 degrees of freedom
##   (588 observations deleted due to missingness)
## Multiple R-squared:  0.9743, Adjusted R-squared:  0.9737 
## F-statistic:  1800 on 4 and 190 DF,  p-value: < 2.2e-16
Model_lm <- lm(distance ~ aircraft+speed_ground+height, data=FAA_U_RmvNA) 
summary(Model_lm)
## 
## Call:
## lm(formula = distance ~ aircraft + speed_ground + height, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -718.73 -228.46  -93.75  124.88 1785.42 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1965.076     70.548  -27.85   <2e-16 ***
## aircraft1     -503.901     25.791  -19.54   <2e-16 ***
## speed_ground    41.941      0.678   61.86   <2e-16 ***
## height          13.938      1.325   10.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 360.2 on 779 degrees of freedom
## Multiple R-squared:  0.8417, Adjusted R-squared:  0.8411 
## F-statistic:  1380 on 3 and 779 DF,  p-value: < 2.2e-16
Model_lm_sg <- lm(distance ~ aircraft+speed_air+height, data=FAA_U_RmvNA) 
summary(Model_lm_sg)
## 
## Call:
## lm(formula = distance ~ aircraft + speed_air + height, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -293.22  -93.83   15.35   90.05  332.84 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5954.3835   109.1050  -54.58   <2e-16 ***
## aircraft1    -433.7406    19.7657  -21.94   <2e-16 ***
## speed_air      82.0393     0.9827   83.48   <2e-16 ***
## height         13.7913     1.0324   13.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.3 on 191 degrees of freedom
##   (588 observations deleted due to missingness)
## Multiple R-squared:  0.9742, Adjusted R-squared:  0.9738 
## F-statistic:  2408 on 3 and 191 DF,  p-value: < 2.2e-16

Insight: * coefficient for Speed ground is approxilately half the value of coefficient of speed air. * When both predictors are taken togather the coefficient of speed ground is negative and speed air is positive * vif(): on speed ground and speed air reveals that these two have multicollinearity. * cor():Also there is a strong correlation between them: 99% * Upon creating a model with predictors : aircraft, speed_ground ,speed_air ,height. we observes that speed ground is insignificant. * Hence, upon trying to remove the speed ground from model we observe that Adj R2 improved.So, we can remove speed ground from our analysis on the basis of 2 factors: 1) It’s correlated to speed air and not landing distance. 2)p value is insignificant in full model

4.5 Final model

based on significance test

Suppose in Table 0, the variable ranking is as follows: X1, X2, X3… Please fit the following six models: Model 1: LD ~ X1 Model 2: LD ~ X1 + X2 Model 3: LD ~ X1 + X2 + X3

Calculate the R-squared for each model. Plot these R-squared values versus the number of variables p. What patterns do you observe? Step 18. Repeat Step 17 but use adjusted R-squared values instead. Step 19. Repeat Step 17 but use AIC values instead. Step 20. Compare the results in Steps 18-19, what variables would you select to build a predictive model for LD?

STEP 17-20

ModelCheck1<-lm(distance ~ aircraft, data=FAA_U_RmvNA) 
ModelCheck2<-lm(distance ~ aircraft+speed_ground, data=FAA_U_RmvNA) 
ModelCheck3<-lm(distance ~ aircraft+speed_ground+speed_air, data=FAA_U_RmvNA) 
ModelCheck4<-lm(distance ~ aircraft+speed_ground+speed_air+height, data=FAA_U_RmvNA) 
ModelCheck5<-lm(distance ~ aircraft+speed_ground+speed_air+height+pitch, data=FAA_U_RmvNA) 
ModelCheck6<-lm(distance ~ aircraft+speed_ground+speed_air+height+pitch+duration, data=FAA_U_RmvNA) 
ModelCheck7<-lm(distance ~ aircraft+speed_ground+speed_air+height+pitch+duration+no_pasg, data=FAA_U_RmvNA) 

r2<-vector("numeric")
r2[1]<-summary(ModelCheck1)$r.squared
r2[2]<-summary(ModelCheck2)$r.squared
r2[3]<-summary(ModelCheck3)$r.squared
r2[4]<-summary(ModelCheck4)$r.squared
r2[5]<-summary(ModelCheck5)$r.squared
r2[6]<-summary(ModelCheck6)$r.squared
r2[7]<-summary(ModelCheck7)$r.squared
round(r2,3)
## [1] 0.052 0.819 0.950 0.974 0.974 0.974 0.975
nv<-c(1,2,3,4,5,6,7)

ar2<-vector("numeric")
ar2[1]<-summary(ModelCheck1)$adj.r.squared
ar2[2]<-summary(ModelCheck2)$adj.r.squared
ar2[3]<-summary(ModelCheck3)$adj.r.squared
ar2[4]<-summary(ModelCheck4)$adj.r.squared
ar2[5]<-summary(ModelCheck5)$adj.r.squared
ar2[6]<-summary(ModelCheck6)$adj.r.squared
ar2[7]<-summary(ModelCheck7)$adj.r.squared

aic<-vector("numeric")
aic[1]<-AIC(ModelCheck1)
aic[2]<-AIC(ModelCheck2)
aic[3]<-AIC(ModelCheck3)
aic[4]<-AIC(ModelCheck4)
aic[5]<-AIC(ModelCheck5)
aic[6]<-AIC(ModelCheck6)
aic[7]<-AIC(ModelCheck7)



paste(round(r2,5)*100,'%')
## [1] "5.229 %"  "81.919 %" "95.046 %" "97.429 %" "97.436 %" "97.443 %"
## [7] "97.471 %"
paste(round(ar2,5)*100,'%')
## [1] "5.108 %"  "81.873 %" "94.968 %" "97.375 %" "97.368 %" "97.362 %"
## [7] "97.376 %"
paste(round(aic,3))
## [1] "12843.84"  "11548.698" "2597.804"  "2471.938"  "2473.383"  "2474.836" 
## [7] "2474.692"
par(mfrow=c(1,3))
plot(r2,nv,col="blue",xlab="R2",ylab="No.of predictors",main="R2 versus number of predictors",pch = 19, cex = 1)
plot(ar2,nv,col="blue",xlab="Adj R2",ylab="No.of predictors",main="Adjusted R2 versus number of predictors",pch = 19, cex = 1)
plot(aic,nv,col="blue",xlab="AIC",ylab="No.of predictors",main="AIC R2 versus number of predictors",pch = 19, cex = 1)

Insight:

R square is continiously inproving as we are adding predictors.Makes a pattern of a curve as seen in the graph Adj Rsquare is continiously inproving as we are adding predictors.Makes a pattern of a curve as seen in the graph, but adj r2 has stopped increasing considerably after model4 where all relevent and significant predictors are included.Post this model an addition of predictors does not bring much change in adj r2. AIC : we observe that AIC decreases considerably as predictors are added, but we also observe that after model4, AIC has stopped considerably decreasing.We understand that lower AIC signifies a the better model. This understanding is in line with our analysis. Conclusion: After our analysis using correlation,R2, Adj R2,AIC on the FAA data we have made a conclusion that the predictors that are selected are :

aircraft , speed_ground , speed_air , height

We should also not forget to mention the intrim analysis that we did on speed ground and speed air and found that:

Upon creating a model with predictors : aircraft, speed_ground ,speed_air ,height. we observes that speed ground is insignificant. Hence, upon trying to remove the speed ground from model we observe that Adj R2 improved.So, we can remove speed ground from our analysis on the basis of 2 factors: 1) It’s correlated to speed air and not landing distance. 2)p value is insignificant in full model

It’s my personal discretion to use speed air as a predictor, but as speed air has many missing values using speed ground is also acceptable.

Hence, our final model is :

aircraft , speed_air , height

Automatic Variable selection

This mechanism requires to remove all observation with missing values. Because speed air column has 75% missing values all those rows are removed from analysis.

Had we choose drop speed air column post the colinearity analysis, and only use speed ground in place of speed air this decrease in sample size would not occour.

Remember, we choose speed air because it was much highly correlated with Landing distance.

STEP 21

Use the R function “StepAIC” to perform forward variable selection. Compare the result with that in Step 19. Using both speed ground & speed air

# Drop rows with specific columns as NAs to avoid error in model building
model_data <- drop_na(FAA_U_RmvNA) 
# Null Model - Regress square feet on only the intercept
nullmodel=lm(distance~1, data=model_data)
# Full Model - Regress square feet on all predictor variables
fullmodel=lm(distance~aircraft+speed_ground+speed_air+height+pitch+duration+no_pasg, data=model_data) 
# Final Model built using stepwise variable selection
model_pred_FAA <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
                         direction='both')
## Start:  AIC=2622.4
## distance ~ 1
## 
##                Df Sum of Sq       RSS    AIC
## + speed_air     1 118926290  14749519 2194.6
## + speed_ground  1 115311069  18364740 2237.3
## + aircraft      1   3978580 129697229 2618.5
## <none>                      133675809 2622.4
## + height        1    445916 133229893 2623.7
## + duration      1    367280 133308529 2623.9
## + pitch         1    154735 133521074 2624.2
## + no_pasg       1    141913 133533895 2624.2
## 
## Step:  AIC=2194.58
## distance ~ speed_air
## 
##                Df Sum of Sq       RSS    AIC
## + aircraft      1   8088045   6661474 2041.6
## + height        1   2623377  12126142 2158.4
## + pitch         1    847903  13901616 2185.0
## <none>                       14749519 2194.6
## + no_pasg       1    142098  14607421 2194.7
## + speed_ground  1     68916  14680603 2195.7
## + duration      1     14495  14735024 2196.4
## - speed_air     1 118926290 133675809 2622.4
## 
## Step:  AIC=2041.58
## distance ~ speed_air + aircraft
## 
##                Df Sum of Sq       RSS    AIC
## + height        1   3217699   3443775 1914.9
## <none>                        6661474 2041.6
## + duration      1     61424   6600051 2041.8
## + no_pasg       1     47410   6614064 2042.2
## + speed_ground  1     39437   6622037 2042.4
## + pitch         1     14544   6646931 2043.2
## - aircraft      1   8088045  14749519 2194.6
## - speed_air     1 123035754 129697229 2618.5
## 
## Step:  AIC=1914.92
## distance ~ speed_air + aircraft + height
## 
##                Df Sum of Sq       RSS    AIC
## + no_pasg       1     39886   3403889 1914.7
## <none>                        3443775 1914.9
## + duration      1     12694   3431082 1916.2
## + pitch         1      8137   3435638 1916.5
## + speed_ground  1      6494   3437282 1916.5
## - height        1   3217699   6661474 2041.6
## - aircraft      1   8682367  12126142 2158.4
## - speed_air     1 125653520 129097295 2619.6
## 
## Step:  AIC=1914.65
## distance ~ speed_air + aircraft + height + no_pasg
## 
##                Df Sum of Sq       RSS    AIC
## <none>                        3403889 1914.7
## - no_pasg       1     39886   3443775 1914.9
## + duration      1      9734   3394155 1916.1
## + pitch         1      8832   3395057 1916.1
## + speed_ground  1      5824   3398066 1916.3
## - height        1   3210175   6614064 2042.2
## - aircraft      1   8588152  11992041 2158.2
## - speed_air     1 125626785 129030674 2621.5

INSIGHT:

Automatic method gives that : distance ~ speed_air + aircraft + height + no_pasg

Dropping speed air

# Drop rows with specific columns as NAs to avoid error in model building
model_data <- FAA_U_RmvNA
# Null Model - Regress square feet on only the intercept
nullmodel=lm(distance~1, data=model_data)
# Full Model - Regress square feet on all predictor variables
fullmodel=lm(distance~aircraft+speed_ground+height+pitch+duration+no_pasg, data=model_data) 
# Final Model built using stepwise variable selection
model_pred_FAA <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
                         direction='both')
## Start:  AIC=10659.83
## distance ~ 1
## 
##                Df Sum of Sq       RSS     AIC
## + speed_ground  1 474544306 163979281  9597.4
## + aircraft      1  33387572 605136015 10619.8
## + height        1   6947258 631576329 10653.3
## + pitch         1   2946920 635576667 10658.2
## + duration      1   1728559 636795028 10659.7
## <none>                      638523587 10659.8
## + no_pasg       1    170040 638353546 10661.6
## 
## Step:  AIC=9597.41
## distance ~ speed_ground
## 
##                Df Sum of Sq       RSS     AIC
## + aircraft      1  48531001 115448279  9324.6
## + height        1  13348111 150631170  9532.9
## + pitch         1   8648983 155330297  9557.0
## <none>                      163979281  9597.4
## + no_pasg       1    263028 163716252  9598.2
## + duration      1     36384 163942897  9599.2
## - speed_ground  1 474544306 638523587 10659.8
## 
## Step:  AIC=9324.64
## distance ~ speed_ground + aircraft
## 
##                Df Sum of Sq       RSS     AIC
## + height        1  14355255 101093024  9222.7
## <none>                      115448279  9324.6
## + pitch         1    207422 115240857  9325.2
## + no_pasg       1     94868 115353412  9326.0
## + duration      1     16936 115431343  9326.5
## - aircraft      1  48531001 163979281  9597.4
## - speed_ground  1 489687736 605136015 10619.8
## 
## Step:  AIC=9222.67
## distance ~ speed_ground + aircraft + height
## 
##                Df Sum of Sq       RSS     AIC
## <none>                      101093024  9222.7
## + no_pasg       1    205566 100887458  9223.1
## + pitch         1     90919 101002105  9224.0
## + duration      1     10794 101082231  9224.6
## - height        1  14355255 115448279  9324.6
## - aircraft      1  49538146 150631170  9532.9
## - speed_ground  1 496573808 597666832 10612.1

INSIGHT:

Automatic method gives that : distance ~ speed_ground + aircraft + height + no_pasg

5.0 Conclusion

Manual process of selection suggested that singnificant predictors are:

aircraft , speed_air , height Automatic method gives that :

distance ~ speed_air + aircraft + height + no_pasg

We observe a discrepancy or no_pasg here,lets try to create this new model and check the addition of no_pasg variable.

model_corr<-lm(distance ~ speed_air + aircraft + height + no_pasg,data=FAA_U_RmvNA)
summary(model_corr)
## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height + no_pasg, 
##     data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -290.38  -89.86    8.45   85.94  358.59 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5831.6802   136.3471 -42.771   <2e-16 ***
## speed_air      82.0317     0.9796  83.739   <2e-16 ***
## aircraft1    -432.0737    19.7342 -21.895   <2e-16 ***
## height         13.7758     1.0291  13.386   <2e-16 ***
## no_pasg        -2.0410     1.3679  -1.492    0.137    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 133.8 on 190 degrees of freedom
##   (588 observations deleted due to missingness)
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.974 
## F-statistic:  1818 on 4 and 190 DF,  p-value: < 2.2e-16

Insight:

We see that no_pasg is not significant as the pvalue is higher.Moreover Adj R square has also increased(slightly). Hence, it better we do not consider this predictor in our model. Landing Distance ~ aircraft, speed_air ,height

Model_lm_sg <- lm(distance ~ aircraft+speed_air+height, data=FAA_U_RmvNA) 
summary(Model_lm_sg)
## 
## Call:
## lm(formula = distance ~ aircraft + speed_air + height, data = FAA_U_RmvNA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -293.22  -93.83   15.35   90.05  332.84 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5954.3835   109.1050  -54.58   <2e-16 ***
## aircraft1    -433.7406    19.7657  -21.94   <2e-16 ***
## speed_air      82.0393     0.9827   83.48   <2e-16 ***
## height         13.7913     1.0324   13.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.3 on 191 degrees of freedom
##   (588 observations deleted due to missingness)
## Multiple R-squared:  0.9742, Adjusted R-squared:  0.9738 
## F-statistic:  2408 on 3 and 191 DF,  p-value: < 2.2e-16