Background:Flight landing.

Motivation:To reduce the risk of landing overrun.

Goal:To study what factor and how they would impact the landing distance of a commercial flight.

Data: landing data(landing distance and other parameters) from 950 commercial flights( not real data set but simulated from statistical models). See two Excel files ‘FAA-1.xls’( 800 flights) and ‘FAA-2.xls’(150 flights).

Variable dictionary:

Aircraft: The make of an aircraft (Boeing or Airbus).

Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.

No_pasg: The number of passengers in a flight.

Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.

Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.

Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

Part 1. Practice of modeling the landing distance using linear regression.

Please write R programs to complete the following steps. In each step, provide o The R code (how do you realize it?) o The R output (Copy and paste only those relevant) o Your observations (What do you observe from the output?) o Your conclusion/decision

Initial exploration of the data

Step 1. Read the two files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ into your R system. Please search “Read Excel files from R” in Google in case you do not know how to do that

setwd("C:/Users/yizha/Desktop/Spring 2020/Statistical Modeling/Week-1")
FAA1 <- read.csv('FAA1.csv',header = T)
FAA2 <- read.csv('FAA2.csv',header = T)

Step 2. Check the structure of each data set using the “str” function. For each data set, what is the sample size and how many variables? Is there any difference between the two data sets?

str(FAA1)
## 'data.frame':    800 obs. of  8 variables:
##  $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : int  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...
str(FAA2)
## 'data.frame':    150 obs. of  7 variables:
##  $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ no_pasg     : int  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

The sample size and variables for FAA1 are 800 and 8. and for FAA2 are 150 and 7. The difference between the two data sets is FAA2 doesn’t have the variable of ‘duration’.

Step 3. Merge the two data sets. Are there any duplications? Search “check duplicates in r” if you do not know how to check duplications. If the answer is “Yes”, what action you would take? use merge() to full outer join to merge the two datasets.

FAA <- merge(FAA1,FAA2,all = TRUE)
str(FAA)
## 'data.frame':    850 obs. of  8 variables:
##  $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 1 1 1 1 1 1 1 1 1 1 ...
##  $ no_pasg     : int  36 38 40 41 43 44 45 45 45 45 ...
##  $ speed_ground: num  47.5 85.2 80.6 97.6 82.5 ...
##  $ speed_air   : num  NA NA NA 97 NA ...
##  $ height      : num  14 37 28.6 38.4 30.1 ...
##  $ pitch       : num  4.3 4.12 3.62 3.53 4.09 ...
##  $ distance    : num  251 1257 1021 2168 1321 ...
##  $ duration    : num  172 188 93.5 123.3 109.2 ...

There are 100 dupilicat observations. after the merge, the dupilicates are removed from the merged dataset. the dataset of FAA has 850 samples and 8 variables.

sum(duplicated(FAA))
## [1] 0

There is no duplicate in the merged new dataset FAA. the dupicates are removed during the merging process.

Step 4. Check the structure of the combined data set. What is the sample size and how many variables? Provide summary statistics for each variable.

str(FAA)
## 'data.frame':    850 obs. of  8 variables:
##  $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 1 1 1 1 1 1 1 1 1 1 ...
##  $ no_pasg     : int  36 38 40 41 43 44 45 45 45 45 ...
##  $ speed_ground: num  47.5 85.2 80.6 97.6 82.5 ...
##  $ speed_air   : num  NA NA NA 97 NA ...
##  $ height      : num  14 37 28.6 38.4 30.1 ...
##  $ pitch       : num  4.3 4.12 3.62 3.53 4.09 ...
##  $ distance    : num  251 1257 1021 2168 1321 ...
##  $ duration    : num  172 188 93.5 123.3 109.2 ...

The structure of the combined data set has 850 observations and 8 variables. Summary of statistics for each variable as follows:

summary(FAA)
##  ï..aircraft     no_pasg      speed_ground      speed_air          height      
##  airbus:450   Min.   :29.0   Min.   : 27.74   Min.   : 90.00   Min.   :-3.546  
##  boeing:400   1st Qu.:55.0   1st Qu.: 65.90   1st Qu.: 96.25   1st Qu.:23.314  
##               Median :60.0   Median : 79.64   Median :101.15   Median :30.093  
##               Mean   :60.1   Mean   : 79.45   Mean   :103.80   Mean   :30.144  
##               3rd Qu.:65.0   3rd Qu.: 92.06   3rd Qu.:109.40   3rd Qu.:36.993  
##               Max.   :87.0   Max.   :141.22   Max.   :141.72   Max.   :59.946  
##                                               NA's   :642                      
##      pitch          distance          duration     
##  Min.   :2.284   Min.   :  34.08   Min.   : 14.76  
##  1st Qu.:3.642   1st Qu.: 883.79   1st Qu.:119.49  
##  Median :4.008   Median :1258.09   Median :153.95  
##  Mean   :4.009   Mean   :1526.02   Mean   :154.01  
##  3rd Qu.:4.377   3rd Qu.:1936.95   3rd Qu.:188.91  
##  Max.   :5.927   Max.   :6533.05   Max.   :305.62  
##                                    NA's   :50

Step 5. By now, if you are asked to prepare ONE presentation slide to summarize your findings, what observations will you bring to the attention of FAA agents?Please list no more than five using “bullet statements”, from the most important to the least important.

  1. The whole dataset contains 850 unique samples with 8 variables,but the dataset contains many mising values.

  2. With 450 airbus and 400 boeing samples, the max landing distance is 6533.05 which is beyond the typical length 6000 feet.

  3. The duration of normal flight should be greater than 40mins, the min number is only 14.76 and 50 missing values.

  4. The value of speed_groud and speed_air should be between 30MPH and 140MPH, the min speed_ground is only 27.74. the missing value for speed_air is 642.

  5. The mini value of height is -3.546 is abnormal value for a aircraft.

Data Cleaning and further exploration

Step 6. Are there abnormal values in the data set? Please refer to the variable dictionary for criteria defining “normal/abnormal” values. Remove the rows that contain any “abnormal values” and report how many rows you have removed.

there are abnormal values in the data set. use the which function to identify abnormals

which(FAA$duration<40)
## [1]  97 242 505 629 706
which(FAA$speed_ground<30)
## [1] 459 658
which(FAA$speed_ground>140)
## [1] 547
which(FAA$speed_air<30)
## integer(0)
which(FAA$speed_air>140)
## [1] 547
which(FAA$height<6)
##  [1] 164 260 377 431 688 726 794 806 828 834
which(FAA$distance>6000)
## [1] 547 743
FAA_Normal_temp<-FAA[-c(97,242,505,629,706,459,658,97,242,505,629,706,547,164,260,377,431,688,726,794,806,828,834,743), ]
str(FAA_Normal_temp)
## 'data.frame':    831 obs. of  8 variables:
##  $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 1 1 1 1 1 1 1 1 1 1 ...
##  $ no_pasg     : int  36 38 40 41 43 44 45 45 45 45 ...
##  $ speed_ground: num  47.5 85.2 80.6 97.6 82.5 ...
##  $ speed_air   : num  NA NA NA 97 NA ...
##  $ height      : num  14 37 28.6 38.4 30.1 ...
##  $ pitch       : num  4.3 4.12 3.62 3.53 4.09 ...
##  $ distance    : num  251 1257 1021 2168 1321 ...
##  $ duration    : num  172 188 93.5 123.3 109.2 ...

19 rows of abnormal was removed from the sample.

Step 7. Repeat Step 4 Check the structure of the combined data set. What is the sample size and how many variables? Provide summary statistics for each variable.

str(FAA_Normal_temp)
## 'data.frame':    831 obs. of  8 variables:
##  $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 1 1 1 1 1 1 1 1 1 1 ...
##  $ no_pasg     : int  36 38 40 41 43 44 45 45 45 45 ...
##  $ speed_ground: num  47.5 85.2 80.6 97.6 82.5 ...
##  $ speed_air   : num  NA NA NA 97 NA ...
##  $ height      : num  14 37 28.6 38.4 30.1 ...
##  $ pitch       : num  4.3 4.12 3.62 3.53 4.09 ...
##  $ distance    : num  251 1257 1021 2168 1321 ...
##  $ duration    : num  172 188 93.5 123.3 109.2 ...
summary(FAA_Normal_temp)
##  ï..aircraft     no_pasg       speed_ground      speed_air     
##  airbus:444   Min.   :29.00   Min.   : 33.57   Min.   : 90.00  
##  boeing:387   1st Qu.:55.00   1st Qu.: 66.20   1st Qu.: 96.23  
##               Median :60.00   Median : 79.79   Median :101.12  
##               Mean   :60.06   Mean   : 79.54   Mean   :103.48  
##               3rd Qu.:65.00   3rd Qu.: 91.91   3rd Qu.:109.36  
##               Max.   :87.00   Max.   :132.78   Max.   :132.91  
##                                                NA's   :628     
##      height           pitch          distance          duration     
##  Min.   : 6.228   Min.   :2.284   Min.   :  41.72   Min.   : 41.95  
##  1st Qu.:23.530   1st Qu.:3.640   1st Qu.: 893.28   1st Qu.:119.63  
##  Median :30.167   Median :4.001   Median :1262.15   Median :154.28  
##  Mean   :30.458   Mean   :4.005   Mean   :1522.48   Mean   :154.78  
##  3rd Qu.:37.004   3rd Qu.:4.370   3rd Qu.:1936.63   3rd Qu.:189.66  
##  Max.   :59.946   Max.   :5.927   Max.   :5381.96   Max.   :305.62  
##                                                     NA's   :50

the new data set contains 831 observations of 8 variables after removing all the abnormal. 19 rows of abnormal was removed from the sample.

Step 8. Since you have a small set of variables, you may want to show histograms for all of them.

Step 9. Prepare another presentation slide to summarize your findings drawn from the cleaned data set, using no more than five “bullet statements”.

  1. The whole dataset contains 831 unique samples with 8 variables.

  2. The mean of landing distance from cleaned data set is 1522. min is 41.72 and max is 5382

  3. The min variable of duration is 41.95 and max is 305.62,but with 50 missing values.

  4. The mean of speed_ground and speed_air are 79.54 and 103.48. there are 628 missing value for speed_air.

  5. The height of flight is in the normal range.

Initial analysis for identifying important factors that impact the response variable “landing distance”

Step 10. Compute the pairwise correlation between the landing distance and each factor X. Provide a table that ranks the factors based on the size (absolute value) of the correlation. This table contains three columns: the names of variables, the size of the correlation, the direction of the correlation (positive or negative). We call it Table 1, which will be used for comparison with our analysis later.

Duration has 50 missing values, impute the missing value with mean-value.

FAA_Normal_temp$duration[is.na(FAA_Normal_temp$duration)] <- mean(FAA_Normal_temp$duration[!is.na(FAA_Normal_temp$duration)])

speed_air has 628 missing value. speed_air is highly correlated with speed_ground. cor(speed_ground,speed_air,use = “complete.obs”)=0.9879383. impute the missing value with regresson-imputation-value.

Ind <- function(t)
{
  x <-dim(length(t))
  x[which(!is.na(t))]=1
  x[which(is.na(t))]=0
  return(x)
}

FAA_Normal_temp$imp <- Ind(FAA_Normal_temp$speed_air)
impute <- lm(FAA_Normal_temp$speed_air~FAA_Normal_temp$speed_ground)
summary(impute)
## 
## Call:
## lm(formula = FAA_Normal_temp$speed_air ~ FAA_Normal_temp$speed_ground)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2630 -0.9407  0.0009  1.1697  3.7466 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   2.92615    1.11678    2.62  0.00946 ** 
## FAA_Normal_temp$speed_ground  0.97242    0.01075   90.45  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.511 on 201 degrees of freedom
##   (628 observations deleted due to missingness)
## Multiple R-squared:  0.976,  Adjusted R-squared:  0.9759 
## F-statistic:  8182 on 1 and 201 DF,  p-value: < 2.2e-16

the regression function for speed_air and speed_ground is speed_air = 2.92615 +0.97242*speed_ground

for (i in 1:nrow(FAA_Normal_temp))
{
  if (FAA_Normal_temp$imp[i] == 0)
  {
    FAA_Normal_temp$speed_air[i] = 2.93 +0.98*FAA_Normal_temp$speed_ground[i]
  }
}
aircraft <- as.numeric(FAA_Normal_temp$ï..aircraft)
cor_aircraft <- cor(FAA_Normal_temp$distance,aircraft)
cor_duration<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$duration)
cor_speedground<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$speed_ground)
cor_pasg<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$no_pasg)
cor_speedair<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$speed_air)
cor_heigt<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$height)
cor_pitch<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$pitch)
Size <- c(cor_aircraft,cor_duration,cor_speedground,cor_pasg,cor_speedair,cor_heigt,cor_pitch)
Direction <-c("positive","negative","positive","negative","positive","positive","positive")
names <- c("aircraft","duration","speed_ground","no_pasg","speed_air","height","pitch")
table_0=data.frame(names,Size,Direction)
table1 <-table_0[order(-abs(table_0$Size)), , drop = FALSE]

Table1

##          names        Size Direction
## 3 speed_ground  0.86624383  positive
## 5    speed_air  0.86513096  positive
## 1     aircraft  0.23814452  positive
## 6       height  0.09941121  positive
## 7        pitch  0.08702846  positive
## 2     duration -0.05026941  negative
## 4      no_pasg -0.01775663  negative

Step 11. Show X-Y scatter plots. Do you think the correlation strength observed in these plots is consistent with the values computed in Step 10?

Yes.The results show on the scatter plots are consistent with the values computed in step 10.

Step 12. Have you included the airplane make as a possible factor in Steps 10-11? You can code this character variable as 0/1.

yes, I included the airplane make as a factor in step 10-11. Code this character variable as 0/1. The step10-11 will be as follows:

aircraft_01 <- (as.numeric(FAA_Normal_temp$ï..aircraft)-1)
cor_aircraft <- cor(FAA_Normal_temp$distance,aircraft)
cor_duration<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$duration,use="complete.obs")
cor_speedground<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$speed_ground,use="complete.obs")
cor_pasg<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$no_pasg,use="complete.obs")
cor_speedair<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$speed_air,use="complete.obs")
cor_heigt<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$height,use="complete.obs")
cor_pitch<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$pitch,use="complete.obs")
Size <- c(cor_aircraft,cor_duration,cor_speedground,cor_pasg,cor_speedair,cor_heigt,cor_pitch)
Direction <-c("positive","negative","positive","negative","positive","positive","positive")
names <- c("aircraft","duration","speed_ground","no_pasg","speed_air","height","pitch")
table_00=data.frame(names,Size,Direction)
table1_1<-table_00[order(-abs(table_0$Size)), , drop = FALSE]

table1_1

table1_1
##          names        Size Direction
## 3 speed_ground  0.86624383  positive
## 5    speed_air  0.86513096  positive
## 1     aircraft  0.23814452  positive
## 6       height  0.09941121  positive
## 7        pitch  0.08702846  positive
## 2     duration -0.05026941  negative
## 4      no_pasg -0.01775663  negative

Step 13. Regress Y (landing distance) on each of the X variables. Provide a table that ranks the factors based on its significance. The smaller the p-value, the more significant the factor. This table contains three columns: the names of variables, the size of the p-value, the direction of the regression coefficient (positive or negative). We call it Table 2.

reg1 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$duration)
Pr1_duration <- summary(reg1)$coefficients[2,4]
reg2 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$no_pasg)
Pr2_pasg <- summary(reg2)$coefficients[2,4]
reg3 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground)
Pr3_speedground <- summary(reg3)$coefficients[2,4]
reg4 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_air)
Pr4_speedair <- summary(reg4)$coefficients[2,4]
reg5 <-lm(FAA_Normal_temp$distance~FAA_Normal_temp$height)
Pr5_height <- summary(reg5)$coefficients[2,4]
reg6 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$pitch)
Pr6_pitch <- summary(reg6)$coefficients[2,4]
reg7 <- lm(FAA_Normal_temp$distance~aircraft)
Pr7_aircraft <- summary(reg7)$coefficients[2,4]
names <- c("duration","no_pasg","speed_ground","speed_air","height","pitch","aircraft")
size_p <- c(Pr1_duration,Pr2_pasg,Pr3_speedground,Pr4_speedair,Pr5_height,Pr6_pitch,Pr7_aircraft)
direction_p <- c("positive","positive","positive","positive","positive","positive","positive")
table_000=data.frame(names,size_p,direction_p)
table2<-table_000[order(table_000$size_p), , drop = FALSE]

table2

##          names        size_p direction_p
## 3 speed_ground 4.766371e-252    positive
## 4    speed_air 1.155925e-250    positive
## 7     aircraft  3.526194e-12    positive
## 5       height  4.123860e-03    positive
## 6        pitch  1.208124e-02    positive
## 1     duration  1.476579e-01    positive
## 2      no_pasg  6.092520e-01    positive

Step 14. Standardize each X variable. In other words, create a new variable X’={X-mean(X)}/sd(X). Regress Y (landing distance) on each of the X’ variables. Provide a table that ranks the factors based on the size of the regression coefficient. The larger the size, the more important the factor. This table contains three columns: the names of variables, the size of the regression coefficient, the direction of the regression coefficient (positive or negative). We call it Table 3.

aircraft_cof <- (aircraft_01-mean(aircraft_01))/sd(aircraft_01)
reg0_0 <- lm(FAA_Normal_temp$distance~aircraft_cof)
Pr0_aircraft_cof <-summary(reg0_0)$coefficients[2,1]
duration1<-(FAA_Normal_temp$duration-mean(FAA_Normal_temp$duration))/sd(FAA_Normal_temp$duration)
reg1_1 <- lm(FAA_Normal_temp$distance~duration1)
Pr1_dur1 <- summary(reg1_1)$coefficients[2,1]
no_pasg1<-(FAA_Normal_temp$no_pasg-mean(FAA_Normal_temp$no_pasg))/sd(FAA_Normal_temp$no_pasg)
reg2_1 <- lm(FAA_Normal_temp$distance~no_pasg1)
Pr2_pasg1 <- summary(reg2_1)$coefficients[2,1]
speed_ground1<-(FAA_Normal_temp$speed_ground-mean(FAA_Normal_temp$speed_ground))/sd(FAA_Normal_temp$speed_ground)
reg3_1 <- lm(FAA_Normal_temp$distance~speed_ground1)
Pr3_spg1 <- summary(reg3_1)$coefficients[2,1]
speed_air1<-(FAA_Normal_temp$speed_air-mean(FAA_Normal_temp$speed_air))/sd(FAA_Normal_temp$speed_air)
reg4_1 <- lm(FAA_Normal_temp$distance~speed_air1)
Pr4_spa1 <- summary(reg4_1)$coefficients[2,1]
height1<-(FAA_Normal_temp$height-mean(FAA_Normal_temp$height))/sd(FAA_Normal_temp$height)
reg5_1 <-lm(FAA_Normal_temp$distance ~height1)
Pr5_hegt1 <- summary(reg5_1)$coefficients[2,1]
pitch1<-(FAA_Normal_temp$pitch-mean(FAA_Normal_temp$pitch))/sd(FAA_Normal_temp$pitch)
reg6_1 <- lm(FAA_Normal_temp$distance~pitch1)
Pr6_pit1 <- summary(reg6_1)$coefficients[2,1]
names_1 <- c("duration1","no_pasg1","speed_ground1","speed_air1","height1","pitch1","aircraft1")
size_1 <- c(Pr1_dur1,Pr2_pasg1,Pr3_spg1,Pr4_spa1,Pr5_hegt1,Pr6_pit1,Pr0_aircraft_cof)
direction_1 <- c("negative","negative","positive","positive","positive","positive","positive")
table_01=data.frame(names_1,size_1,direction_1)
table3<-table_01[order(-abs(table_01$size)), , drop = FALSE]

table3

##         names_1    size_1 direction_1
## 3 speed_ground1 776.44740    positive
## 4    speed_air1 775.44988    positive
## 7     aircraft1 213.45802    positive
## 5       height1  89.10606    positive
## 6        pitch1  78.00693    positive
## 1     duration1 -45.05839    negative
## 2      no_pasg1 -15.91595    negative

Step 15. Compare Tables 1,2,3. Are the results consistent? At this point, you will meet with a FAA agent again. Please provide a single table than ranks all the factors based on their relative importance in determining the landing distance. We call it Table 0.

table1
##          names        Size Direction
## 3 speed_ground  0.86624383  positive
## 5    speed_air  0.86513096  positive
## 1     aircraft  0.23814452  positive
## 6       height  0.09941121  positive
## 7        pitch  0.08702846  positive
## 2     duration -0.05026941  negative
## 4      no_pasg -0.01775663  negative
table2
##          names        size_p direction_p
## 3 speed_ground 4.766371e-252    positive
## 4    speed_air 1.155925e-250    positive
## 7     aircraft  3.526194e-12    positive
## 5       height  4.123860e-03    positive
## 6        pitch  1.208124e-02    positive
## 1     duration  1.476579e-01    positive
## 2      no_pasg  6.092520e-01    positive
table3
##         names_1    size_1 direction_1
## 3 speed_ground1 776.44740    positive
## 4    speed_air1 775.44988    positive
## 7     aircraft1 213.45802    positive
## 5       height1  89.10606    positive
## 6        pitch1  78.00693    positive
## 1     duration1 -45.05839    negative
## 2      no_pasg1 -15.91595    negative

We can conclude from these three tables are consistent. the ranking of all the factors in determining the landing distance are as follows:

table0

## [1] "speed_ground" "speed_air"    "aircraft"     "height"       "pitch"       
## [6] "duration"     "no_pasg"

Check collinearity

Step 16. Compare the regression coefficients of the three models below: Model 1: LD ~ Speed_ground Model 2: LD ~ Speed_air Model 3: LD ~ Speed_ground + Speed_air Do you observe any significance change and sign change? Check the correlation between Speed_ground and Speed_air. You may want to keep one of them in the model selection. Which one would you pick? Why?

reg3 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground)
summary(reg3)
## 
## Call:
## lm(formula = FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -897.09 -319.16  -72.09  210.83 1798.88 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -1773.9407    67.8388  -26.15   <2e-16 ***
## FAA_Normal_temp$speed_ground    41.4422     0.8302   49.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 448.1 on 829 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7501 
## F-statistic:  2492 on 1 and 829 DF,  p-value: < 2.2e-16
reg4 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_air)
summary(reg4)
## 
## Call:
## lm(formula = FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -903.06 -320.50  -70.28  215.92 1817.21 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1930.1608    71.2487  -27.09   <2e-16 ***
## FAA_Normal_temp$speed_air    42.7893     0.8616   49.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 449.8 on 829 degrees of freedom
## Multiple R-squared:  0.7485, Adjusted R-squared:  0.7481 
## F-statistic:  2467 on 1 and 829 DF,  p-value: < 2.2e-16
reg34 <-lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+FAA_Normal_temp$speed_air)
summary(reg34)
## 
## Call:
## lm(formula = FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground + 
##     FAA_Normal_temp$speed_air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -896.00 -317.85  -72.66  207.47 1796.14 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -1741.513    102.261 -17.030   <2e-16 ***
## FAA_Normal_temp$speed_ground    49.644     19.365   2.564   0.0105 *  
## FAA_Normal_temp$speed_air       -8.488     20.020  -0.424   0.6717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 448.3 on 828 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7498 
## F-statistic:  1245 on 2 and 828 DF,  p-value: < 2.2e-16
cor(FAA_Normal_temp$speed_ground,FAA_Normal_temp$speed_air)
## [1] 0.9990797

We can tell Model 3: LD ~ Speed_ground + Speed_air, the sign of the speed_air change from positive to negative. The correlation between the speed_ground and speed_air is highly related, the collinearity exits in this situation. Compare the Adjusted R-squared I will keep the Model 1: LD ~ Speed_ground for later analysis.

Step 17. Suppose in Table 0, the variable ranking is as follows: X1, X2, X3….. Please fit the following six models: Model 1: LD ~ X1 Model 2: LD ~ X1 + X2 Model 3: LD ~ X1 + X2 + X3 …. Calculate the R-squared for each model. Plot these R-squared values versus the number of variables p. What patterns do you observe?

## [1] "speed_ground" "speed_air"    "aircraft"     "height"       "pitch"       
## [6] "duration"     "no_pasg"

We removed the speed_air from the step16. there are only 6 variables left: “speed_ground” “aircraft” “height” “pitch” “duration” “no_pasg”

model1 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground)
model2 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+aircraft_01)
model3 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+aircraft_01+FAA_Normal_temp$height)
model4 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+aircraft_01+FAA_Normal_temp$height+FAA_Normal_temp$pitch)
model5 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+aircraft_01+FAA_Normal_temp$height+FAA_Normal_temp$pitch+FAA_Normal_temp$duration)
model6 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+aircraft_01+FAA_Normal_temp$height+FAA_Normal_temp$pitch+FAA_Normal_temp$duration+FAA_Normal_temp$no_pasg)
names_model<-c("model1","model2","model3","model4","model5","model6")
Rsquared <- c(summary(model1)$r.squared,summary(model2)$r.squared,summary(model3)$r.squared,summary(model4)$r.squared,summary(model5)$r.squared,summary(model6)$r.squared)
no_variable <-c(1,2,3,4,5,6)
table_temp=data.frame(names_model,Rsquared,no_variable)
table_model<-table_temp[order(-table_temp$Rsquared), , drop = FALSE]
##   names_model  Rsquared no_variable
## 6      model6 0.8497162           6
## 5      model5 0.8493818           5
## 4      model4 0.8493717           4
## 3      model3 0.8488989           3
## 2      model2 0.8251319           2
## 1      model1 0.7503784           1

we can see from the talbe_model that R squared increases everytime with addtional varible add into the model.

Step 18. Repeat Step 17 but use adjusted R-squared values instead.

names_model<-c("model1","model2","model3","model4","model5","model6")
adj_Rsquared <- c(summary(model1)$adj.r.squared,summary(model2)$adj.r.squared,summary(model3)$adj.r.squared,summary(model4)$adj.r.squared,summary(model5)$adj.r.squared,summary(model6)$adj.r.squared)
no_variable <-c(1,2,3,4,5,6)
table_temp_adj=data.frame(names_model,adj_Rsquared,no_variable)
table_model_adj<-table_temp_adj[order(-table_temp_adj$adj_Rsquared), , drop = FALSE]
table_model_adj
##   names_model adj_Rsquared no_variable
## 4      model4    0.8486423           4
## 6      model6    0.8486219           6
## 5      model5    0.8484689           5
## 3      model3    0.8483508           3
## 2      model2    0.8247095           2
## 1      model1    0.7500773           1

The adjusted Rsquared is the best with 4 variables involved. The four variables are speed_ground,aircraft,height and pitch.

Step 19. Repeat Step 17 but use AIC values instead.

names_model<-c("model1","model2","model3","model4","model5","model6")
no_variable <-c(1,2,3,4,5,6)
##   names_model AIC_VALUE no_variable
## 4      model4  12095.05           4
## 3      model3  12095.65           3
## 5      model5  12096.99           5
## 6      model6  12097.14           6
## 2      model2  12215.05           2
## 1      model1  12508.81           1

AIC the lower the better. the AIC of the model4 is the best with the result 12095.05. we pick up four variable of speed_ground, aircraft,height and pitch.

Step 20. Compare the results in Steps 18-19, what variables would you select to build a predictive model for LD?

Compare the results of steps 18-19. the results are consistent. we pick up the model4 for our estimation. the four variable are speed_ground, aircraft,height and pitch.

Variable selection based on automate algorithm.

Step 21. Use the R function “StepAIC” to perform forward variable selection. Compare the result with that in Step 19.

modelstart <-lm(FAA_Normal_temp$distance~1)
step(modelstart,direction = "forward",scope = formula(model6))
## Start:  AIC=11299.8
## FAA_Normal_temp$distance ~ 1
## 
##                                Df Sum of Sq       RSS   AIC
## + FAA_Normal_temp$speed_ground  1 500382567 166457762 10148
## + aircraft_01                   1  37818390 629021939 11253
## + FAA_Normal_temp$height        1   6590108 660250221 11294
## + FAA_Normal_temp$pitch         1   5050617 661789712 11296
## + FAA_Normal_temp$duration      1   1685114 665155215 11300
## <none>                                      666840329 11300
## + FAA_Normal_temp$no_pasg       1    210253 666630076 11302
## 
## Step:  AIC=10148.53
## FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground
## 
##                            Df Sum of Sq       RSS     AIC
## + aircraft_01               1  49848656 116609106  9854.8
## + FAA_Normal_temp$height    1  14916377 151541385 10072.5
## + FAA_Normal_temp$pitch     1   9765095 156692668 10100.3
## <none>                                  166457762 10148.5
## + FAA_Normal_temp$no_pasg   1    207528 166250234 10149.5
## + FAA_Normal_temp$duration  1     51669 166406094 10150.3
## 
## Step:  AIC=9854.77
## FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground + aircraft_01
## 
##                            Df Sum of Sq       RSS    AIC
## + FAA_Normal_temp$height    1  15848830 100760276 9735.4
## + FAA_Normal_temp$pitch     1    455453 116153654 9853.5
## <none>                                  116609106 9854.8
## + FAA_Normal_temp$no_pasg   1     87171 116521935 9856.1
## + FAA_Normal_temp$duration  1      8445 116600661 9856.7
## 
## Step:  AIC=9735.37
## FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground + aircraft_01 + 
##     FAA_Normal_temp$height
## 
##                            Df Sum of Sq       RSS    AIC
## + FAA_Normal_temp$pitch     1    315259 100445017 9734.8
## <none>                                  100760276 9735.4
## + FAA_Normal_temp$no_pasg   1    232003 100528273 9735.5
## + FAA_Normal_temp$duration  1      3976 100756300 9737.3
## 
## Step:  AIC=9734.77
## FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground + aircraft_01 + 
##     FAA_Normal_temp$height + FAA_Normal_temp$pitch
## 
##                            Df Sum of Sq       RSS    AIC
## <none>                                  100445017 9734.8
## + FAA_Normal_temp$no_pasg   1    225608 100219409 9734.9
## + FAA_Normal_temp$duration  1      6696 100438321 9736.7
## 
## Call:
## lm(formula = FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground + 
##     aircraft_01 + FAA_Normal_temp$height + FAA_Normal_temp$pitch)
## 
## Coefficients:
##                  (Intercept)  FAA_Normal_temp$speed_ground  
##                     -2664.32                         42.43  
##                  aircraft_01        FAA_Normal_temp$height  
##                       481.27                         14.09  
##        FAA_Normal_temp$pitch  
##                        39.61

The result of stepAIC is pick up the variable of speed_ground, aircraft,height and pitch. with result of 9734.77. The variables we pick in step21 is consistent with the step 18-20.