Background:Flight landing.
Motivation:To reduce the risk of landing overrun.
Goal:To study what factor and how they would impact the landing distance of a commercial flight.
Data: landing data(landing distance and other parameters) from 950 commercial flights( not real data set but simulated from statistical models). See two Excel files ‘FAA-1.xls’( 800 flights) and ‘FAA-2.xls’(150 flights).
Variable dictionary:
Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.
Part 1. Practice of modeling the landing distance using linear regression.
Please write R programs to complete the following steps. In each step, provide o The R code (how do you realize it?) o The R output (Copy and paste only those relevant) o Your observations (What do you observe from the output?) o Your conclusion/decision
Initial exploration of the data
Step 1. Read the two files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ into your R system. Please search “Read Excel files from R” in Google in case you do not know how to do that
setwd("C:/Users/yizha/Desktop/Spring 2020/Statistical Modeling/Week-1")
FAA1 <- read.csv('FAA1.csv',header = T)
FAA2 <- read.csv('FAA2.csv',header = T)
Step 2. Check the structure of each data set using the “str” function. For each data set, what is the sample size and how many variables? Is there any difference between the two data sets?
str(FAA1)
## 'data.frame': 800 obs. of 8 variables:
## $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : int 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
str(FAA2)
## 'data.frame': 150 obs. of 7 variables:
## $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ no_pasg : int 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
The sample size and variables for FAA1 are 800 and 8. and for FAA2 are 150 and 7. The difference between the two data sets is FAA2 doesn’t have the variable of ‘duration’.
Step 3. Merge the two data sets. Are there any duplications? Search “check duplicates in r” if you do not know how to check duplications. If the answer is “Yes”, what action you would take? use merge() to full outer join to merge the two datasets.
FAA <- merge(FAA1,FAA2,all = TRUE)
str(FAA)
## 'data.frame': 850 obs. of 8 variables:
## $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 1 1 1 1 1 1 1 1 1 1 ...
## $ no_pasg : int 36 38 40 41 43 44 45 45 45 45 ...
## $ speed_ground: num 47.5 85.2 80.6 97.6 82.5 ...
## $ speed_air : num NA NA NA 97 NA ...
## $ height : num 14 37 28.6 38.4 30.1 ...
## $ pitch : num 4.3 4.12 3.62 3.53 4.09 ...
## $ distance : num 251 1257 1021 2168 1321 ...
## $ duration : num 172 188 93.5 123.3 109.2 ...
There are 100 dupilicat observations. after the merge, the dupilicates are removed from the merged dataset. the dataset of FAA has 850 samples and 8 variables.
sum(duplicated(FAA))
## [1] 0
There is no duplicate in the merged new dataset FAA. the dupicates are removed during the merging process.
Step 4. Check the structure of the combined data set. What is the sample size and how many variables? Provide summary statistics for each variable.
str(FAA)
## 'data.frame': 850 obs. of 8 variables:
## $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 1 1 1 1 1 1 1 1 1 1 ...
## $ no_pasg : int 36 38 40 41 43 44 45 45 45 45 ...
## $ speed_ground: num 47.5 85.2 80.6 97.6 82.5 ...
## $ speed_air : num NA NA NA 97 NA ...
## $ height : num 14 37 28.6 38.4 30.1 ...
## $ pitch : num 4.3 4.12 3.62 3.53 4.09 ...
## $ distance : num 251 1257 1021 2168 1321 ...
## $ duration : num 172 188 93.5 123.3 109.2 ...
The structure of the combined data set has 850 observations and 8 variables. Summary of statistics for each variable as follows:
summary(FAA)
## ï..aircraft no_pasg speed_ground speed_air height
## airbus:450 Min. :29.0 Min. : 27.74 Min. : 90.00 Min. :-3.546
## boeing:400 1st Qu.:55.0 1st Qu.: 65.90 1st Qu.: 96.25 1st Qu.:23.314
## Median :60.0 Median : 79.64 Median :101.15 Median :30.093
## Mean :60.1 Mean : 79.45 Mean :103.80 Mean :30.144
## 3rd Qu.:65.0 3rd Qu.: 92.06 3rd Qu.:109.40 3rd Qu.:36.993
## Max. :87.0 Max. :141.22 Max. :141.72 Max. :59.946
## NA's :642
## pitch distance duration
## Min. :2.284 Min. : 34.08 Min. : 14.76
## 1st Qu.:3.642 1st Qu.: 883.79 1st Qu.:119.49
## Median :4.008 Median :1258.09 Median :153.95
## Mean :4.009 Mean :1526.02 Mean :154.01
## 3rd Qu.:4.377 3rd Qu.:1936.95 3rd Qu.:188.91
## Max. :5.927 Max. :6533.05 Max. :305.62
## NA's :50
Step 5. By now, if you are asked to prepare ONE presentation slide to summarize your findings, what observations will you bring to the attention of FAA agents?Please list no more than five using “bullet statements”, from the most important to the least important.
The whole dataset contains 850 unique samples with 8 variables,but the dataset contains many mising values.
With 450 airbus and 400 boeing samples, the max landing distance is 6533.05 which is beyond the typical length 6000 feet.
The duration of normal flight should be greater than 40mins, the min number is only 14.76 and 50 missing values.
The value of speed_groud and speed_air should be between 30MPH and 140MPH, the min speed_ground is only 27.74. the missing value for speed_air is 642.
The mini value of height is -3.546 is abnormal value for a aircraft.
Data Cleaning and further exploration
Step 6. Are there abnormal values in the data set? Please refer to the variable dictionary for criteria defining “normal/abnormal” values. Remove the rows that contain any “abnormal values” and report how many rows you have removed.
there are abnormal values in the data set. use the which function to identify abnormals
which(FAA$duration<40)
## [1] 97 242 505 629 706
which(FAA$speed_ground<30)
## [1] 459 658
which(FAA$speed_ground>140)
## [1] 547
which(FAA$speed_air<30)
## integer(0)
which(FAA$speed_air>140)
## [1] 547
which(FAA$height<6)
## [1] 164 260 377 431 688 726 794 806 828 834
which(FAA$distance>6000)
## [1] 547 743
FAA_Normal_temp<-FAA[-c(97,242,505,629,706,459,658,97,242,505,629,706,547,164,260,377,431,688,726,794,806,828,834,743), ]
str(FAA_Normal_temp)
## 'data.frame': 831 obs. of 8 variables:
## $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 1 1 1 1 1 1 1 1 1 1 ...
## $ no_pasg : int 36 38 40 41 43 44 45 45 45 45 ...
## $ speed_ground: num 47.5 85.2 80.6 97.6 82.5 ...
## $ speed_air : num NA NA NA 97 NA ...
## $ height : num 14 37 28.6 38.4 30.1 ...
## $ pitch : num 4.3 4.12 3.62 3.53 4.09 ...
## $ distance : num 251 1257 1021 2168 1321 ...
## $ duration : num 172 188 93.5 123.3 109.2 ...
19 rows of abnormal was removed from the sample.
Step 7. Repeat Step 4 Check the structure of the combined data set. What is the sample size and how many variables? Provide summary statistics for each variable.
str(FAA_Normal_temp)
## 'data.frame': 831 obs. of 8 variables:
## $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 1 1 1 1 1 1 1 1 1 1 ...
## $ no_pasg : int 36 38 40 41 43 44 45 45 45 45 ...
## $ speed_ground: num 47.5 85.2 80.6 97.6 82.5 ...
## $ speed_air : num NA NA NA 97 NA ...
## $ height : num 14 37 28.6 38.4 30.1 ...
## $ pitch : num 4.3 4.12 3.62 3.53 4.09 ...
## $ distance : num 251 1257 1021 2168 1321 ...
## $ duration : num 172 188 93.5 123.3 109.2 ...
summary(FAA_Normal_temp)
## ï..aircraft no_pasg speed_ground speed_air
## airbus:444 Min. :29.00 Min. : 33.57 Min. : 90.00
## boeing:387 1st Qu.:55.00 1st Qu.: 66.20 1st Qu.: 96.23
## Median :60.00 Median : 79.79 Median :101.12
## Mean :60.06 Mean : 79.54 Mean :103.48
## 3rd Qu.:65.00 3rd Qu.: 91.91 3rd Qu.:109.36
## Max. :87.00 Max. :132.78 Max. :132.91
## NA's :628
## height pitch distance duration
## Min. : 6.228 Min. :2.284 Min. : 41.72 Min. : 41.95
## 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.28 1st Qu.:119.63
## Median :30.167 Median :4.001 Median :1262.15 Median :154.28
## Mean :30.458 Mean :4.005 Mean :1522.48 Mean :154.78
## 3rd Qu.:37.004 3rd Qu.:4.370 3rd Qu.:1936.63 3rd Qu.:189.66
## Max. :59.946 Max. :5.927 Max. :5381.96 Max. :305.62
## NA's :50
the new data set contains 831 observations of 8 variables after removing all the abnormal. 19 rows of abnormal was removed from the sample.
Step 8. Since you have a small set of variables, you may want to show histograms for all of them.
Step 9. Prepare another presentation slide to summarize your findings drawn from the cleaned data set, using no more than five “bullet statements”.
The whole dataset contains 831 unique samples with 8 variables.
The mean of landing distance from cleaned data set is 1522. min is 41.72 and max is 5382
The min variable of duration is 41.95 and max is 305.62,but with 50 missing values.
The mean of speed_ground and speed_air are 79.54 and 103.48. there are 628 missing value for speed_air.
The height of flight is in the normal range.
Initial analysis for identifying important factors that impact the response variable “landing distance”
Step 10. Compute the pairwise correlation between the landing distance and each factor X. Provide a table that ranks the factors based on the size (absolute value) of the correlation. This table contains three columns: the names of variables, the size of the correlation, the direction of the correlation (positive or negative). We call it Table 1, which will be used for comparison with our analysis later.
Duration has 50 missing values, impute the missing value with mean-value.
FAA_Normal_temp$duration[is.na(FAA_Normal_temp$duration)] <- mean(FAA_Normal_temp$duration[!is.na(FAA_Normal_temp$duration)])
speed_air has 628 missing value. speed_air is highly correlated with speed_ground. cor(speed_ground,speed_air,use = “complete.obs”)=0.9879383. impute the missing value with regresson-imputation-value.
Ind <- function(t)
{
x <-dim(length(t))
x[which(!is.na(t))]=1
x[which(is.na(t))]=0
return(x)
}
FAA_Normal_temp$imp <- Ind(FAA_Normal_temp$speed_air)
impute <- lm(FAA_Normal_temp$speed_air~FAA_Normal_temp$speed_ground)
summary(impute)
##
## Call:
## lm(formula = FAA_Normal_temp$speed_air ~ FAA_Normal_temp$speed_ground)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2630 -0.9407 0.0009 1.1697 3.7466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.92615 1.11678 2.62 0.00946 **
## FAA_Normal_temp$speed_ground 0.97242 0.01075 90.45 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.511 on 201 degrees of freedom
## (628 observations deleted due to missingness)
## Multiple R-squared: 0.976, Adjusted R-squared: 0.9759
## F-statistic: 8182 on 1 and 201 DF, p-value: < 2.2e-16
the regression function for speed_air and speed_ground is speed_air = 2.92615 +0.97242*speed_ground
for (i in 1:nrow(FAA_Normal_temp))
{
if (FAA_Normal_temp$imp[i] == 0)
{
FAA_Normal_temp$speed_air[i] = 2.93 +0.98*FAA_Normal_temp$speed_ground[i]
}
}
aircraft <- as.numeric(FAA_Normal_temp$ï..aircraft)
cor_aircraft <- cor(FAA_Normal_temp$distance,aircraft)
cor_duration<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$duration)
cor_speedground<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$speed_ground)
cor_pasg<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$no_pasg)
cor_speedair<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$speed_air)
cor_heigt<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$height)
cor_pitch<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$pitch)
Size <- c(cor_aircraft,cor_duration,cor_speedground,cor_pasg,cor_speedair,cor_heigt,cor_pitch)
Direction <-c("positive","negative","positive","negative","positive","positive","positive")
names <- c("aircraft","duration","speed_ground","no_pasg","speed_air","height","pitch")
table_0=data.frame(names,Size,Direction)
table1 <-table_0[order(-abs(table_0$Size)), , drop = FALSE]
Table1
## names Size Direction
## 3 speed_ground 0.86624383 positive
## 5 speed_air 0.86513096 positive
## 1 aircraft 0.23814452 positive
## 6 height 0.09941121 positive
## 7 pitch 0.08702846 positive
## 2 duration -0.05026941 negative
## 4 no_pasg -0.01775663 negative
Step 11. Show X-Y scatter plots. Do you think the correlation strength observed in these plots is consistent with the values computed in Step 10?
Yes.The results show on the scatter plots are consistent with the values computed in step 10.
Step 12. Have you included the airplane make as a possible factor in Steps 10-11? You can code this character variable as 0/1.
yes, I included the airplane make as a factor in step 10-11. Code this character variable as 0/1. The step10-11 will be as follows:
aircraft_01 <- (as.numeric(FAA_Normal_temp$ï..aircraft)-1)
cor_aircraft <- cor(FAA_Normal_temp$distance,aircraft)
cor_duration<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$duration,use="complete.obs")
cor_speedground<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$speed_ground,use="complete.obs")
cor_pasg<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$no_pasg,use="complete.obs")
cor_speedair<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$speed_air,use="complete.obs")
cor_heigt<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$height,use="complete.obs")
cor_pitch<-cor(FAA_Normal_temp$distance,FAA_Normal_temp$pitch,use="complete.obs")
Size <- c(cor_aircraft,cor_duration,cor_speedground,cor_pasg,cor_speedair,cor_heigt,cor_pitch)
Direction <-c("positive","negative","positive","negative","positive","positive","positive")
names <- c("aircraft","duration","speed_ground","no_pasg","speed_air","height","pitch")
table_00=data.frame(names,Size,Direction)
table1_1<-table_00[order(-abs(table_0$Size)), , drop = FALSE]
table1_1
table1_1
## names Size Direction
## 3 speed_ground 0.86624383 positive
## 5 speed_air 0.86513096 positive
## 1 aircraft 0.23814452 positive
## 6 height 0.09941121 positive
## 7 pitch 0.08702846 positive
## 2 duration -0.05026941 negative
## 4 no_pasg -0.01775663 negative
Step 13. Regress Y (landing distance) on each of the X variables. Provide a table that ranks the factors based on its significance. The smaller the p-value, the more significant the factor. This table contains three columns: the names of variables, the size of the p-value, the direction of the regression coefficient (positive or negative). We call it Table 2.
reg1 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$duration)
Pr1_duration <- summary(reg1)$coefficients[2,4]
reg2 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$no_pasg)
Pr2_pasg <- summary(reg2)$coefficients[2,4]
reg3 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground)
Pr3_speedground <- summary(reg3)$coefficients[2,4]
reg4 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_air)
Pr4_speedair <- summary(reg4)$coefficients[2,4]
reg5 <-lm(FAA_Normal_temp$distance~FAA_Normal_temp$height)
Pr5_height <- summary(reg5)$coefficients[2,4]
reg6 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$pitch)
Pr6_pitch <- summary(reg6)$coefficients[2,4]
reg7 <- lm(FAA_Normal_temp$distance~aircraft)
Pr7_aircraft <- summary(reg7)$coefficients[2,4]
names <- c("duration","no_pasg","speed_ground","speed_air","height","pitch","aircraft")
size_p <- c(Pr1_duration,Pr2_pasg,Pr3_speedground,Pr4_speedair,Pr5_height,Pr6_pitch,Pr7_aircraft)
direction_p <- c("positive","positive","positive","positive","positive","positive","positive")
table_000=data.frame(names,size_p,direction_p)
table2<-table_000[order(table_000$size_p), , drop = FALSE]
table2
## names size_p direction_p
## 3 speed_ground 4.766371e-252 positive
## 4 speed_air 1.155925e-250 positive
## 7 aircraft 3.526194e-12 positive
## 5 height 4.123860e-03 positive
## 6 pitch 1.208124e-02 positive
## 1 duration 1.476579e-01 positive
## 2 no_pasg 6.092520e-01 positive
Step 14. Standardize each X variable. In other words, create a new variable X’={X-mean(X)}/sd(X). Regress Y (landing distance) on each of the X’ variables. Provide a table that ranks the factors based on the size of the regression coefficient. The larger the size, the more important the factor. This table contains three columns: the names of variables, the size of the regression coefficient, the direction of the regression coefficient (positive or negative). We call it Table 3.
aircraft_cof <- (aircraft_01-mean(aircraft_01))/sd(aircraft_01)
reg0_0 <- lm(FAA_Normal_temp$distance~aircraft_cof)
Pr0_aircraft_cof <-summary(reg0_0)$coefficients[2,1]
duration1<-(FAA_Normal_temp$duration-mean(FAA_Normal_temp$duration))/sd(FAA_Normal_temp$duration)
reg1_1 <- lm(FAA_Normal_temp$distance~duration1)
Pr1_dur1 <- summary(reg1_1)$coefficients[2,1]
no_pasg1<-(FAA_Normal_temp$no_pasg-mean(FAA_Normal_temp$no_pasg))/sd(FAA_Normal_temp$no_pasg)
reg2_1 <- lm(FAA_Normal_temp$distance~no_pasg1)
Pr2_pasg1 <- summary(reg2_1)$coefficients[2,1]
speed_ground1<-(FAA_Normal_temp$speed_ground-mean(FAA_Normal_temp$speed_ground))/sd(FAA_Normal_temp$speed_ground)
reg3_1 <- lm(FAA_Normal_temp$distance~speed_ground1)
Pr3_spg1 <- summary(reg3_1)$coefficients[2,1]
speed_air1<-(FAA_Normal_temp$speed_air-mean(FAA_Normal_temp$speed_air))/sd(FAA_Normal_temp$speed_air)
reg4_1 <- lm(FAA_Normal_temp$distance~speed_air1)
Pr4_spa1 <- summary(reg4_1)$coefficients[2,1]
height1<-(FAA_Normal_temp$height-mean(FAA_Normal_temp$height))/sd(FAA_Normal_temp$height)
reg5_1 <-lm(FAA_Normal_temp$distance ~height1)
Pr5_hegt1 <- summary(reg5_1)$coefficients[2,1]
pitch1<-(FAA_Normal_temp$pitch-mean(FAA_Normal_temp$pitch))/sd(FAA_Normal_temp$pitch)
reg6_1 <- lm(FAA_Normal_temp$distance~pitch1)
Pr6_pit1 <- summary(reg6_1)$coefficients[2,1]
names_1 <- c("duration1","no_pasg1","speed_ground1","speed_air1","height1","pitch1","aircraft1")
size_1 <- c(Pr1_dur1,Pr2_pasg1,Pr3_spg1,Pr4_spa1,Pr5_hegt1,Pr6_pit1,Pr0_aircraft_cof)
direction_1 <- c("negative","negative","positive","positive","positive","positive","positive")
table_01=data.frame(names_1,size_1,direction_1)
table3<-table_01[order(-abs(table_01$size)), , drop = FALSE]
table3
## names_1 size_1 direction_1
## 3 speed_ground1 776.44740 positive
## 4 speed_air1 775.44988 positive
## 7 aircraft1 213.45802 positive
## 5 height1 89.10606 positive
## 6 pitch1 78.00693 positive
## 1 duration1 -45.05839 negative
## 2 no_pasg1 -15.91595 negative
Step 15. Compare Tables 1,2,3. Are the results consistent? At this point, you will meet with a FAA agent again. Please provide a single table than ranks all the factors based on their relative importance in determining the landing distance. We call it Table 0.
table1
## names Size Direction
## 3 speed_ground 0.86624383 positive
## 5 speed_air 0.86513096 positive
## 1 aircraft 0.23814452 positive
## 6 height 0.09941121 positive
## 7 pitch 0.08702846 positive
## 2 duration -0.05026941 negative
## 4 no_pasg -0.01775663 negative
table2
## names size_p direction_p
## 3 speed_ground 4.766371e-252 positive
## 4 speed_air 1.155925e-250 positive
## 7 aircraft 3.526194e-12 positive
## 5 height 4.123860e-03 positive
## 6 pitch 1.208124e-02 positive
## 1 duration 1.476579e-01 positive
## 2 no_pasg 6.092520e-01 positive
table3
## names_1 size_1 direction_1
## 3 speed_ground1 776.44740 positive
## 4 speed_air1 775.44988 positive
## 7 aircraft1 213.45802 positive
## 5 height1 89.10606 positive
## 6 pitch1 78.00693 positive
## 1 duration1 -45.05839 negative
## 2 no_pasg1 -15.91595 negative
We can conclude from these three tables are consistent. the ranking of all the factors in determining the landing distance are as follows:
table0
## [1] "speed_ground" "speed_air" "aircraft" "height" "pitch"
## [6] "duration" "no_pasg"
Check collinearity
Step 16. Compare the regression coefficients of the three models below: Model 1: LD ~ Speed_ground Model 2: LD ~ Speed_air Model 3: LD ~ Speed_ground + Speed_air Do you observe any significance change and sign change? Check the correlation between Speed_ground and Speed_air. You may want to keep one of them in the model selection. Which one would you pick? Why?
reg3 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground)
summary(reg3)
##
## Call:
## lm(formula = FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground)
##
## Residuals:
## Min 1Q Median 3Q Max
## -897.09 -319.16 -72.09 210.83 1798.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1773.9407 67.8388 -26.15 <2e-16 ***
## FAA_Normal_temp$speed_ground 41.4422 0.8302 49.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 448.1 on 829 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.7501
## F-statistic: 2492 on 1 and 829 DF, p-value: < 2.2e-16
reg4 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_air)
summary(reg4)
##
## Call:
## lm(formula = FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -903.06 -320.50 -70.28 215.92 1817.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1930.1608 71.2487 -27.09 <2e-16 ***
## FAA_Normal_temp$speed_air 42.7893 0.8616 49.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 449.8 on 829 degrees of freedom
## Multiple R-squared: 0.7485, Adjusted R-squared: 0.7481
## F-statistic: 2467 on 1 and 829 DF, p-value: < 2.2e-16
reg34 <-lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+FAA_Normal_temp$speed_air)
summary(reg34)
##
## Call:
## lm(formula = FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground +
## FAA_Normal_temp$speed_air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -896.00 -317.85 -72.66 207.47 1796.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1741.513 102.261 -17.030 <2e-16 ***
## FAA_Normal_temp$speed_ground 49.644 19.365 2.564 0.0105 *
## FAA_Normal_temp$speed_air -8.488 20.020 -0.424 0.6717
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 448.3 on 828 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.7498
## F-statistic: 1245 on 2 and 828 DF, p-value: < 2.2e-16
cor(FAA_Normal_temp$speed_ground,FAA_Normal_temp$speed_air)
## [1] 0.9990797
We can tell Model 3: LD ~ Speed_ground + Speed_air, the sign of the speed_air change from positive to negative. The correlation between the speed_ground and speed_air is highly related, the collinearity exits in this situation. Compare the Adjusted R-squared I will keep the Model 1: LD ~ Speed_ground for later analysis.
Step 17. Suppose in Table 0, the variable ranking is as follows: X1, X2, X3….. Please fit the following six models: Model 1: LD ~ X1 Model 2: LD ~ X1 + X2 Model 3: LD ~ X1 + X2 + X3 …. Calculate the R-squared for each model. Plot these R-squared values versus the number of variables p. What patterns do you observe?
## [1] "speed_ground" "speed_air" "aircraft" "height" "pitch"
## [6] "duration" "no_pasg"
We removed the speed_air from the step16. there are only 6 variables left: “speed_ground” “aircraft” “height” “pitch” “duration” “no_pasg”
model1 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground)
model2 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+aircraft_01)
model3 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+aircraft_01+FAA_Normal_temp$height)
model4 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+aircraft_01+FAA_Normal_temp$height+FAA_Normal_temp$pitch)
model5 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+aircraft_01+FAA_Normal_temp$height+FAA_Normal_temp$pitch+FAA_Normal_temp$duration)
model6 <- lm(FAA_Normal_temp$distance~FAA_Normal_temp$speed_ground+aircraft_01+FAA_Normal_temp$height+FAA_Normal_temp$pitch+FAA_Normal_temp$duration+FAA_Normal_temp$no_pasg)
names_model<-c("model1","model2","model3","model4","model5","model6")
Rsquared <- c(summary(model1)$r.squared,summary(model2)$r.squared,summary(model3)$r.squared,summary(model4)$r.squared,summary(model5)$r.squared,summary(model6)$r.squared)
no_variable <-c(1,2,3,4,5,6)
table_temp=data.frame(names_model,Rsquared,no_variable)
table_model<-table_temp[order(-table_temp$Rsquared), , drop = FALSE]
## names_model Rsquared no_variable
## 6 model6 0.8497162 6
## 5 model5 0.8493818 5
## 4 model4 0.8493717 4
## 3 model3 0.8488989 3
## 2 model2 0.8251319 2
## 1 model1 0.7503784 1
we can see from the talbe_model that R squared increases everytime with addtional varible add into the model.
Step 18. Repeat Step 17 but use adjusted R-squared values instead.
names_model<-c("model1","model2","model3","model4","model5","model6")
adj_Rsquared <- c(summary(model1)$adj.r.squared,summary(model2)$adj.r.squared,summary(model3)$adj.r.squared,summary(model4)$adj.r.squared,summary(model5)$adj.r.squared,summary(model6)$adj.r.squared)
no_variable <-c(1,2,3,4,5,6)
table_temp_adj=data.frame(names_model,adj_Rsquared,no_variable)
table_model_adj<-table_temp_adj[order(-table_temp_adj$adj_Rsquared), , drop = FALSE]
table_model_adj
## names_model adj_Rsquared no_variable
## 4 model4 0.8486423 4
## 6 model6 0.8486219 6
## 5 model5 0.8484689 5
## 3 model3 0.8483508 3
## 2 model2 0.8247095 2
## 1 model1 0.7500773 1
The adjusted Rsquared is the best with 4 variables involved. The four variables are speed_ground,aircraft,height and pitch.
Step 19. Repeat Step 17 but use AIC values instead.
names_model<-c("model1","model2","model3","model4","model5","model6")
no_variable <-c(1,2,3,4,5,6)
## names_model AIC_VALUE no_variable
## 4 model4 12095.05 4
## 3 model3 12095.65 3
## 5 model5 12096.99 5
## 6 model6 12097.14 6
## 2 model2 12215.05 2
## 1 model1 12508.81 1
AIC the lower the better. the AIC of the model4 is the best with the result 12095.05. we pick up four variable of speed_ground, aircraft,height and pitch.
Step 20. Compare the results in Steps 18-19, what variables would you select to build a predictive model for LD?
Compare the results of steps 18-19. the results are consistent. we pick up the model4 for our estimation. the four variable are speed_ground, aircraft,height and pitch.
Variable selection based on automate algorithm.
Step 21. Use the R function “StepAIC” to perform forward variable selection. Compare the result with that in Step 19.
modelstart <-lm(FAA_Normal_temp$distance~1)
step(modelstart,direction = "forward",scope = formula(model6))
## Start: AIC=11299.8
## FAA_Normal_temp$distance ~ 1
##
## Df Sum of Sq RSS AIC
## + FAA_Normal_temp$speed_ground 1 500382567 166457762 10148
## + aircraft_01 1 37818390 629021939 11253
## + FAA_Normal_temp$height 1 6590108 660250221 11294
## + FAA_Normal_temp$pitch 1 5050617 661789712 11296
## + FAA_Normal_temp$duration 1 1685114 665155215 11300
## <none> 666840329 11300
## + FAA_Normal_temp$no_pasg 1 210253 666630076 11302
##
## Step: AIC=10148.53
## FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground
##
## Df Sum of Sq RSS AIC
## + aircraft_01 1 49848656 116609106 9854.8
## + FAA_Normal_temp$height 1 14916377 151541385 10072.5
## + FAA_Normal_temp$pitch 1 9765095 156692668 10100.3
## <none> 166457762 10148.5
## + FAA_Normal_temp$no_pasg 1 207528 166250234 10149.5
## + FAA_Normal_temp$duration 1 51669 166406094 10150.3
##
## Step: AIC=9854.77
## FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground + aircraft_01
##
## Df Sum of Sq RSS AIC
## + FAA_Normal_temp$height 1 15848830 100760276 9735.4
## + FAA_Normal_temp$pitch 1 455453 116153654 9853.5
## <none> 116609106 9854.8
## + FAA_Normal_temp$no_pasg 1 87171 116521935 9856.1
## + FAA_Normal_temp$duration 1 8445 116600661 9856.7
##
## Step: AIC=9735.37
## FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground + aircraft_01 +
## FAA_Normal_temp$height
##
## Df Sum of Sq RSS AIC
## + FAA_Normal_temp$pitch 1 315259 100445017 9734.8
## <none> 100760276 9735.4
## + FAA_Normal_temp$no_pasg 1 232003 100528273 9735.5
## + FAA_Normal_temp$duration 1 3976 100756300 9737.3
##
## Step: AIC=9734.77
## FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground + aircraft_01 +
## FAA_Normal_temp$height + FAA_Normal_temp$pitch
##
## Df Sum of Sq RSS AIC
## <none> 100445017 9734.8
## + FAA_Normal_temp$no_pasg 1 225608 100219409 9734.9
## + FAA_Normal_temp$duration 1 6696 100438321 9736.7
##
## Call:
## lm(formula = FAA_Normal_temp$distance ~ FAA_Normal_temp$speed_ground +
## aircraft_01 + FAA_Normal_temp$height + FAA_Normal_temp$pitch)
##
## Coefficients:
## (Intercept) FAA_Normal_temp$speed_ground
## -2664.32 42.43
## aircraft_01 FAA_Normal_temp$height
## 481.27 14.09
## FAA_Normal_temp$pitch
## 39.61
The result of stepAIC is pick up the variable of speed_ground, aircraft,height and pitch. with result of 9734.77. The variables we pick in step21 is consistent with the step 18-20.