What is Landing Distance and why is it imp ?
Definition Landing Distance is the horizontal distance traversed by the aeroplane by the aeroplane from a point on the approach path at a selected height above the landing surface to the point on the landing surface at which the aeroplane comes to a complete stop.
(Source: ICAO Annex 8 Part IIIA Paragraph 2.2.3.3. and Part IIIB Sub-part B Paragraph B2.7 e)
Factors Affecting Actual Landing Distance
Variable dictionary:
Required Packages
Read the two files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ into your R system. Please search “Read Excel files from R” in Google in case you do not know how to do that.
STEP 01
# Read the sheets, one by one
FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")
dim(FAA1)
## [1] 800 8
dim(FAA2)
## [1] 150 7
STEP 02
Check the structure of each data set using the “str” function. For each data set, what is the sample size and how many variables? Is there any difference between the two data sets?
str(FAA1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 800 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
str(FAA2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 7 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
FAA1 : 800 Rows and 8 columns , sample size is 800 rows
FAA2 : 150 Rows and 7 columns ,sample size is 150 rows
FAA2 does not have duration column
FAA1 AND FAA2 have similar structure, datatypes.
In order to combine the 2 data sets we need to make the sructure of both same, by structure we refer to columns.Hence we need to add a column duration to FAA2 and then use rbind function to find 150 +800 =950 rows sample size=950
Merge the 2 data sets FAA1 & 2 to create a composite dataset
STEP 03
FAA2$duration<-NA
head(FAA2)
## # A tibble: 6 x 8
## aircraft no_pasg speed_ground speed_air height pitch distance duration
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 boeing 53 108. 109. 27.4 4.04 3370. NA
## 2 boeing 69 102. 103. 27.8 4.12 2988. NA
## 3 boeing 61 71.1 NA 18.6 4.43 1145. NA
## 4 boeing 56 85.8 NA 30.7 3.88 1664. NA
## 5 boeing 70 59.9 NA 32.4 4.03 1050. NA
## 6 boeing 55 75.0 NA 41.2 4.20 1627. NA
FAA_merge<-rbind(FAA1,FAA2)
str(FAA_merge)
## Classes 'tbl_df', 'tbl' and 'data.frame': 950 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
dim(FAA_merge)
## [1] 950 8
In order to combine the 2 data sets we need to make the sructure of both same, by structure we refer to columns.Hence we need to add a column duration to FAA2 and then use rbind function to find 150 +800 =950 rows #sample size=950
STEP 04 Check the structure of the combined data set. What is the sample size and how many variables? Provide summary statistics for each variable.
#FAA_merge %>% distinct()
row_dup<-duplicated(FAA_merge[,-2])
class(row_dup)
## [1] "logical"
FAA_U<-FAA_merge[!row_dup,]
dim(FAA_U)
## [1] 850 8
Duplicate Check: Because the FAA2 has NA in duration column , we need to check for duplicates rows excluding that column. We need to remove the duplicate rows if we find any to reduce redundancy
dim(FAA_U)
## [1] 850 8
str(FAA_U)
## Classes 'tbl_df', 'tbl' and 'data.frame': 850 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
summary(FAA_U)
## aircraft duration no_pasg speed_ground
## Length:850 Min. : 14.76 Min. :29.0 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90
## Mode :character Median :153.95 Median :60.0 Median : 79.64
## Mean :154.01 Mean :60.1 Mean : 79.45
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06
## Max. :305.62 Max. :87.0 Max. :141.22
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :101.15 Median :30.093 Median :4.008 Median :1258.09
## Mean :103.80 Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :642
Summary Statistics: we find 850 unique rows and 8 columns. Below is the summary statistics of each column in dataset
STEP 05
KEY INSIGHTS:
STEP 06
Are there abnormal values in the data set? Please refer to the variable dictionary for criteria defining “normal/abnormal” values. Remove the rows that contain any “abnormal values” and report how many rows you have removed.
FAA_U<-FAA_U[FAA_U$duration>=40,]
FAA_U<-FAA_U[FAA_U$speed_ground>=30 && FAA_U$speed_ground<=140,]
FAA_U<-FAA_U[FAA_U$height>=6,]
FAA_U<-FAA_U[FAA_U$distance<=6000,]
nrow(FAA_U)
## [1] 833
sum(is.na(FAA_U$duration))
## [1] 50
sum(is.na(FAA_U$duration))
## [1] 50
sum(is.na(FAA_U$speed_ground))
## [1] 50
sum(is.na(FAA_U$height))
## [1] 50
sum(is.na(FAA_U$distance))
## [1] 50
#Remove all rowa that have NA in all the columns
FAA_U_RmvNA<-FAA_U %>% filter(!(is.na(aircraft)&is.na(duration)&is.na(no_pasg)&is.na(speed_ground)&is.na(speed_air)&is.na(height)&is.na(pitch)&is.na(distance)))
#nrow(FAA_U_RmvNA)
#Looking for NA Values in columns
paste("Number of NA in AIRCRAFT:" , sum(is.na(FAA_U_RmvNA$aircraft)))
## [1] "Number of NA in AIRCRAFT: 0"
paste("Number of NA in DURATION:" , sum(is.na(FAA_U_RmvNA$duration)))
## [1] "Number of NA in DURATION: 0"
paste("Number of NA in NO_PASG:" , sum(is.na(FAA_U_RmvNA$no_pasg)))
## [1] "Number of NA in NO_PASG: 0"
paste("Number of NA in SPEED GROUND:" , sum(is.na(FAA_U_RmvNA$speed_ground)))
## [1] "Number of NA in SPEED GROUND: 0"
paste("Number of NA in SPEED AIR:" , sum(is.na(FAA_U_RmvNA$speed_air)))
## [1] "Number of NA in SPEED AIR: 588"
paste("Number of NA in HEIGHT:" , sum(is.na(FAA_U_RmvNA$height)))
## [1] "Number of NA in HEIGHT: 0"
paste("Number of NA in PITCH:" , sum(is.na(FAA_U_RmvNA$pitch)))
## [1] "Number of NA in PITCH: 0"
paste("Number of NA in DISTANCE:" , sum(is.na(FAA_U_RmvNA$distance)))
## [1] "Number of NA in DISTANCE: 0"
nrow(FAA_U_RmvNA)
## [1] 783
As first steps in Data cleaning we need to filter the rows that do not qualify for data analysis.The are 2 major categories of such line items: Abnormal values: We need to remove abnormal values(values that do not qualify for analysis, as suggested by SMEs).Below code removes them NA Rows: We need to remove NA rows(where all cell values are NA).
###3.1 Summary Statistics:
STEP 07
#Sample Size:783
summary(FAA_U_RmvNA)
## aircraft duration no_pasg speed_ground
## Length:783 Min. : 41.95 Min. :29.00 Min. : 27.74
## Class :character 1st Qu.:119.67 1st Qu.:55.00 1st Qu.: 66.01
## Mode :character Median :154.28 Median :60.00 Median : 79.75
## Mean :154.83 Mean :60.07 Mean : 79.51
## 3rd Qu.:189.75 3rd Qu.:65.00 3rd Qu.: 92.13
## Max. :305.62 Max. :87.00 Max. :132.78
##
## speed_air height pitch distance
## Min. : 90.00 Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.15 1st Qu.:23.562 1st Qu.:3.654 1st Qu.: 919.67
## Median :100.89 Median :30.203 Median :4.017 Median :1273.66
## Mean :103.50 Mean :30.438 Mean :4.015 Mean :1540.33
## 3rd Qu.:109.42 3rd Qu.:36.984 3rd Qu.:4.385 3rd Qu.:1960.41
## Max. :132.91 Max. :59.946 Max. :5.927 Max. :5381.96
## NA's :588
Key Insight:
Speed Air: We observe that Speed Air still has a considerable number of NA Values.Close to 75% which can be a problem during analysis
distance: considerable difference between minimum and Maximum values of distance variable.
###3.2 Univariate analysis: STEP 08 Since you have a small set of variables, you may want to show histograms for all of them.
for(i in 2:ncol(FAA_U_RmvNA)){
w<-as.numeric(unlist(round(FAA_U_RmvNA[,i],2)))
par(mfrow=c(2,1))
hist(w,main = paste("Histogram of" , colnames(FAA_U_RmvNA)[i]))
qqnorm(w,main = paste("QQ PLOT of" , colnames(FAA_U_RmvNA)[i]))
qqline(w, col='red',)
}
STEP 09
Initial analysis for identifying important factors that impact the #response variable “landing distance” STEP 10
Compute the pairwise correlation between the landing distance and each factor X. Provide a table that ranks the factors based on the size (absolute value) of the correlation. This table contains three columns: the names of variables, the size of the correlation, the direction of the correlation (positive or negative). We call it Table 1, which will be used for comparison with our analysis later
#Correlation table
FAA_U_RmvNA[,1]<-ifelse(FAA_U_RmvNA[,1]=='boeing',0,1)
#FAA_U_RmvNA$aircraft<-as.factor(FAA_U_RmvNA$aircraft)
str(FAA_U_RmvNA)
## Classes 'tbl_df', 'tbl' and 'data.frame': 783 obs. of 8 variables:
## $ aircraft : num [1:783, 1] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "aircraft"
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
cor_vec <- vector("numeric")
abs_vec <- vector("character")
for(i in 1:8){
if(i==1){
cor_vec[i]<-cor(FAA_U_RmvNA[,i],FAA_U_RmvNA[,8])
}
if(i==2|i==3|i==4|i==6|i==7|i==8){
cor_vec[i]<-cor(FAA_U_RmvNA[,i],FAA_U_RmvNA[,8])
}
if(i==5){
f<-na.omit(FAA_U_RmvNA)
cor_vec[i] <-round(cor(f[,5],f[,8]),3)
}
abs_vec[i]<-ifelse(sign(cor_vec[i])==-1,"Negative","Positive")
}
df<-data.frame(cbind(noquote(colnames(FAA_U_RmvNA)),noquote(round(cor_vec,3)),noquote(abs_vec)))
colnames(df)[1] <- "Attribute"
colnames(df)[2] <- "Correlation"
colnames(df)[3] <- "Sign"
df
## Attribute Correlation Sign
## 1 aircraft -0.229 Negative
## 2 duration -0.052 Negative
## 3 no_pasg -0.016 Negative
## 4 speed_ground 0.862 Positive
## 5 speed_air 0.943 Positive
## 6 height 0.104 Positive
## 7 pitch 0.068 Positive
## 8 distance 1 Positive
FAA_U_RmvNA$aircraft<-as.factor(FAA_U_RmvNA$aircraft)
STEP 11
Show X-Y scatter plots. Do you think the correlation strength observed in these plots is consistent with the values computed in Step 10
par(mfrow=c(2,2))
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$aircraft,col="blue",xlab="distance",ylab="Aircraft")
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$duration,col="blue",xlab="distance",ylab="duration")
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$no_pasg,col="blue",xlab="distance",ylab="no_pasg")
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$speed_ground,col="blue",xlab="distance",ylab="speed_ground")
par(mfrow=c(2,2))
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$speed_air,col="blue",xlab="distance",ylab="speed_air")
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$height,col="blue",xlab="distance",ylab="height")
plot(FAA_U_RmvNA$distance~FAA_U_RmvNA$pitch,col="blue",xlab="distance",ylab="pitch")
* Key Insight: We observe that Speed ground and Speed Air have a strong correlation with Landing distance.All other variables appear to be random.
STEP 12
Have you included the airplane make as a possible factor in Steps 10-11? You can code this character variable as 0/1.
Yes, aircraft type is also included in the data analysis by converting the values of categories to 0 and 1
boeing =0, airbus=1.
STEP 13
Regress Y (landing distance) on each of the X variables. Provide a table that ranks the factors based on its significance. The smaller the p-value, the more significant the factor. This table contains three columns: the names of variables, the size of the p-value, the direction of the regression coefficient (positive or negative). We call it Table 2.
using a single factor each time
fit1<-lm(distance~aircraft,FAA_U_RmvNA)
fit2<-lm(distance~duration,FAA_U_RmvNA)
fit3<-lm(distance~no_pasg,FAA_U_RmvNA)
fit4<-lm(distance~speed_ground,FAA_U_RmvNA)
fit5<-lm(distance~speed_air,FAA_U_RmvNA)
fit6<-lm(distance~height,FAA_U_RmvNA)
fit7<-lm(distance~pitch,FAA_U_RmvNA)
summary(fit1)
##
## Call:
## lm(formula = distance ~ aircraft, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1293.4 -636.3 -233.0 390.5 3633.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1748.15 44.63 39.170 < 2e-16 ***
## aircraft1 -413.00 62.92 -6.564 9.52e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 880.2 on 781 degrees of freedom
## Multiple R-squared: 0.05229, Adjusted R-squared: 0.05108
## F-statistic: 43.09 on 1 and 781 DF, p-value: 9.522e-11
summary(fit2)#duration
##
## Call:
## lm(formula = distance ~ duration, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1463.7 -614.1 -273.7 408.9 3848.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1690.9392 108.3535 15.606 <2e-16 ***
## duration -0.9727 0.6681 -1.456 0.146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 903 on 781 degrees of freedom
## Multiple R-squared: 0.002707, Adjusted R-squared: 0.00143
## F-statistic: 2.12 on 1 and 781 DF, p-value: 0.1458
summary(fit3)
##
## Call:
## lm(formula = distance ~ no_pasg, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1465.5 -629.0 -263.6 411.8 3865.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1657.903 259.783 6.382 3e-10 ***
## no_pasg -1.957 4.291 -0.456 0.648
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 904.1 on 781 degrees of freedom
## Multiple R-squared: 0.0002663, Adjusted R-squared: -0.001014
## F-statistic: 0.208 on 1 and 781 DF, p-value: 0.6484
summary(fit4)#speed ground
##
## Call:
## lm(formula = distance ~ speed_ground, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -918.86 -322.87 -75.58 209.68 1900.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1711.1215 70.3255 -24.33 <2e-16 ***
## speed_ground 40.8941 0.8602 47.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 458.2 on 781 degrees of freedom
## Multiple R-squared: 0.7432, Adjusted R-squared: 0.7429
## F-statistic: 2260 on 1 and 781 DF, p-value: < 2.2e-16
summary(fit5)#speed air
##
## Call:
## lm(formula = distance ~ speed_air, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -783.22 -189.61 2.73 215.76 623.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5417.607 208.860 -25.94 <2e-16 ***
## speed_air 79.244 2.009 39.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.4 on 193 degrees of freedom
## (588 observations deleted due to missingness)
## Multiple R-squared: 0.8897, Adjusted R-squared: 0.8891
## F-statistic: 1556 on 1 and 193 DF, p-value: < 2.2e-16
summary(fit6)#height
##
## Call:
## lm(formula = distance ~ height, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1347.5 -612.0 -248.3 410.4 3921.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1245.566 105.578 11.798 < 2e-16 ***
## height 9.684 3.304 2.931 0.00348 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 899.3 on 781 degrees of freedom
## Multiple R-squared: 0.01088, Adjusted R-squared: 0.009614
## F-statistic: 8.591 on 1 and 781 DF, p-value: 0.003477
summary(fit7)
##
## Call:
## lm(formula = distance ~ pitch, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1355.6 -646.9 -252.4 403.1 3826.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1068.2 250.2 4.269 2.2e-05 ***
## pitch 117.6 61.8 1.903 0.0574 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 902.1 on 781 degrees of freedom
## Multiple R-squared: 0.004615, Adjusted R-squared: 0.003341
## F-statistic: 3.621 on 1 and 781 DF, p-value: 0.05742
inte <- vector("numeric")
pvalue <- vector("numeric")
coeff<-vector("numeric")
icoeff<-vector("numeric")
#Intercept:
inte[1]<-summary(fit1)$coefficients[1,4]
inte[2]<-summary(fit2)$coefficients[1,4]
inte[3]<-summary(fit3)$coefficients[1,4]
inte[4]<-summary(fit4)$coefficients[1,4]
inte[5]<-summary(fit5)$coefficients[1,4]
inte[6]<-summary(fit6)$coefficients[1,4]
inte[7]<-summary(fit7)$coefficients[1,4]
pvalue[1]<-summary(fit1)$coefficients[2,4]
pvalue[2]<-summary(fit2)$coefficients[2,4]
pvalue[3]<-summary(fit3)$coefficients[2,4]
pvalue[4]<-summary(fit4)$coefficients[2,4]
pvalue[5]<-summary(fit5)$coefficients[2,4]
pvalue[6]<-summary(fit6)$coefficients[2,4]
pvalue[7]<-summary(fit7)$coefficients[2,4]
coeff[1]<-summary(fit1)$coefficients[2,1]
coeff[2]<-summary(fit2)$coefficients[2,1]
coeff[3]<-summary(fit3)$coefficients[2,1]
coeff[4]<-summary(fit4)$coefficients[2,1]
coeff[5]<-summary(fit5)$coefficients[2,1]
coeff[6]<-summary(fit6)$coefficients[2,1]
coeff[7]<-summary(fit7)$coefficients[2,1]
icoeff[1]<-summary(fit1)$coefficients[1,1]
icoeff[2]<-summary(fit2)$coefficients[1,1]
icoeff[3]<-summary(fit3)$coefficients[1,1]
icoeff[4]<-summary(fit4)$coefficients[1,1]
icoeff[5]<-summary(fit5)$coefficients[1,1]
icoeff[6]<-summary(fit6)$coefficients[1,1]
icoeff[7]<-summary(fit7)$coefficients[1,1]
length(colnames(FAA_U_RmvNA)[1:7])
## [1] 7
length(inte)
## [1] 7
length(pvalue)
## [1] 7
length(coeff)
## [1] 7
length(icoeff)
## [1] 7
sign_rc<-vector("character")
sign_rc<-ifelse(sign(coeff)==-1,"Negative","Positive")
df2<-data.frame(colnames(FAA_U_RmvNA)[1:7],ifelse(round(inte)<.0001,'<.0001',round(inte)),ifelse(round(pvalue,5)<0.0001,'<.0001',round(pvalue,5)),icoeff,round(coeff,2),sign_rc)
colnames(df2)[1] <- "Attribute"
colnames(df2)[2] <- "Beta0Pvl"
colnames(df2)[3] <- "Beta1Pvl"
colnames(df2)[4] <- "Beta0"
colnames(df2)[5] <- "Beta1"
colnames(df2)[6] <- "Sign_rc"
arrange(df2,Beta1Pvl)
## Attribute Beta0Pvl Beta1Pvl Beta0 Beta1 Sign_rc
## 1 aircraft <.0001 <.0001 1748.152 -413.00 Negative
## 2 speed_ground <.0001 <.0001 -1711.121 40.89 Positive
## 3 speed_air <.0001 <.0001 -5417.607 79.24 Positive
## 4 height <.0001 0.00348 1245.566 9.68 Positive
## 5 pitch <.0001 0.05742 1068.187 117.59 Positive
## 6 duration <.0001 0.14579 1690.939 -0.97 Negative
## 7 no_pasg <.0001 0.64844 1657.903 -1.96 Negative
Insight: as per pavalue analysis we observe that aircraft,speed_ground,speed_air,height,pitch(borderline) are significant predictors to landing distance.This conclusion gives added information to correlation analysis that only showed that speed_ground amd speed air are associated with Landing distance.
STEP 14
Standardize each X variable. In other words, create a new variable X’= {X-mean(X)}/sd(X). The mean of X’ is 0 and its standard deviation is 1. Regress Y (landing distance) on each of the X’ variables. Provide a table that ranks the factors based on the size of the regression coefficient. The larger the size, the more important the factor. This table contains three columns: the names of variables, the size of the regression coefficient, the direction of the regression coefficient (positive or negative). We call it Table 3.
#create a new dataframe
FAA_U_RmvNA_SD<-FAA_U_RmvNA
#scale function to standardize variables
#(FAA_U_RmvNA_SD$aircraft<-round(scale(FAA_U_RmvNA$aircraft),3))#ignoring aircraft as of now##############################
FAA_U_RmvNA_SD$duration<-round(scale(FAA_U_RmvNA$duration),3)
FAA_U_RmvNA_SD$no_pasg<-round(scale(FAA_U_RmvNA$no_pasg),3)
FAA_U_RmvNA_SD$speed_ground<-round(scale(FAA_U_RmvNA$speed_ground),3)
FAA_U_RmvNA_SD$speed_air<-round(scale(FAA_U_RmvNA$speed_air),3)
FAA_U_RmvNA_SD$height<-round(scale(FAA_U_RmvNA$height),3)
FAA_U_RmvNA_SD$pitch<-round(scale(FAA_U_RmvNA$pitch),3)
#(FAA_U_RmvNA_SD$distance<-round(scale(FAA_U_RmvNA$distance),3))
fits1<-lm(distance~aircraft,FAA_U_RmvNA_SD)#not scaled
fits2<-lm(distance~duration,FAA_U_RmvNA_SD)
fits3<-lm(distance~no_pasg,FAA_U_RmvNA_SD)
fits4<-lm(distance~speed_ground,FAA_U_RmvNA_SD)
fits5<-lm(distance~speed_air,FAA_U_RmvNA_SD)
fits6<-lm(distance~height,FAA_U_RmvNA_SD)
fits7<-lm(distance~pitch,FAA_U_RmvNA_SD)
intes <- vector("numeric")
pvalues <- vector("numeric")
coeffs<-vector("numeric")
icoeffs<-vector("numeric")
#Intercept:
intes[1]<-summary(fits1)$coefficients[1,4]
intes[2]<-summary(fits2)$coefficients[1,4]
intes[3]<-summary(fits3)$coefficients[1,4]
intes[4]<-summary(fits4)$coefficients[1,4]
intes[5]<-summary(fits5)$coefficients[1,4]
intes[6]<-summary(fits6)$coefficients[1,4]
intes[7]<-summary(fits7)$coefficients[1,4]
pvalues[1]<-summary(fits1)$coefficients[2,4]
pvalues[2]<-summary(fits2)$coefficients[2,4]
pvalues[3]<-summary(fits3)$coefficients[2,4]
pvalues[4]<-summary(fits4)$coefficients[2,4]
pvalues[5]<-summary(fits5)$coefficients[2,4]
pvalues[6]<-summary(fits6)$coefficients[2,4]
pvalues[7]<-summary(fits7)$coefficients[2,4]
coeffs[1]<-summary(fits1)$coefficients[2,1]
coeffs[2]<-summary(fits2)$coefficients[2,1]
coeffs[3]<-summary(fits3)$coefficients[2,1]
coeffs[4]<-summary(fits4)$coefficients[2,1]
coeffs[5]<-summary(fits5)$coefficients[2,1]
coeffs[6]<-summary(fits6)$coefficients[2,1]
coeffs[7]<-summary(fits7)$coefficients[2,1]
icoeffs[1]<-summary(fits1)$coefficients[1,1]
icoeffs[2]<-summary(fits2)$coefficients[1,1]
icoeffs[3]<-summary(fits3)$coefficients[1,1]
icoeffs[4]<-summary(fits4)$coefficients[1,1]
icoeffs[5]<-summary(fits5)$coefficients[1,1]
icoeffs[6]<-summary(fits6)$coefficients[1,1]
icoeffs[7]<-summary(fits7)$coefficients[1,1]
sign_rcs<-vector("character")
sign_rcs<-ifelse(sign(coeffs)==-1,"Negative","Positive")
df3<-data.frame(colnames(FAA_U_RmvNA)[1:7],ifelse(round(intes)<.0001,'<.0001',round(intes)),ifelse(round(pvalues,5)<0.0001,'<.0001',round(pvalues,5)),icoeffs,round(coeffs,2),sign_rcs)
colnames(df3)[1] <- "Attribute"
colnames(df3)[2] <- "Beta0Pvl"
colnames(df3)[3] <- "Beta1Pvl"
colnames(df3)[4] <- "Beta0"
colnames(df3)[5] <- "Beta1"
colnames(df3)[6] <- "Sign_Beta1"
arrange(df3,desc(Beta1))
## Attribute Beta0Pvl Beta1Pvl Beta0 Beta1 Sign_Beta1
## 1 speed_air <.0001 <.0001 2784.512 782.94 Positive
## 2 speed_ground <.0001 <.0001 1540.318 779.00 Positive
## 3 height <.0001 0.00348 1540.335 94.24 Positive
## 4 pitch <.0001 0.05743 1540.333 61.38 Positive
## 5 no_pasg <.0001 0.64815 1540.333 -14.76 Negative
## 6 duration <.0001 0.14569 1540.334 -47.03 Negative
## 7 aircraft <.0001 <.0001 1748.152 -413.00 Negative
Insight: * As per scaled variable analysis speed air, speed ground, height, pitch(pvalue-borderline) are significant. * This analysis is in line with out previous analysis of unscalled variables. * We further observe that after scaling the coefficients become more prominent and larger that unscaled coefficents. * Because Pitch variable is on borderline, for this analysis we choose to consider it insignificant. A deeper analysis to understand to its significance with more data can help us be sure about its significance to Landing Distance.
Analysis table STEP 15
Compare Tables 1,2,3. Are the results consistent? At this point, you will meet with a FAA agent again. Please provide a single table than ranks all the factors based on their relative importance in determining the landing distance. We call it Table 0.
#Table 0
#arrange by p value for significance
df4<-arrange(df3,Beta1Pvl)
#TABLE 0 :
#select coefficients for intercept and slope
df0<-select(df4,Attribute,Beta0,Beta1)
STEP 16
Compare the regression coefficients of the three models below: Model 1: LD ~ Speed_ground Model 2: LD ~ Speed_air Model 3: LD ~ Speed_ground + Speed_air
model1<-lm(FAA_U_RmvNA$distance~FAA_U_RmvNA$speed_ground)
model2<-lm(FAA_U_RmvNA$distance~FAA_U_RmvNA$speed_air)
model3<-lm(FAA_U_RmvNA$distance~FAA_U_RmvNA$speed_ground+FAA_U_RmvNA$speed_air)#sg insig
summary(model1)
##
## Call:
## lm(formula = FAA_U_RmvNA$distance ~ FAA_U_RmvNA$speed_ground)
##
## Residuals:
## Min 1Q Median 3Q Max
## -918.86 -322.87 -75.58 209.68 1900.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1711.1215 70.3255 -24.33 <2e-16 ***
## FAA_U_RmvNA$speed_ground 40.8941 0.8602 47.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 458.2 on 781 degrees of freedom
## Multiple R-squared: 0.7432, Adjusted R-squared: 0.7429
## F-statistic: 2260 on 1 and 781 DF, p-value: < 2.2e-16
summary(model2)
##
## Call:
## lm(formula = FAA_U_RmvNA$distance ~ FAA_U_RmvNA$speed_air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -783.22 -189.61 2.73 215.76 623.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5417.607 208.860 -25.94 <2e-16 ***
## FAA_U_RmvNA$speed_air 79.244 2.009 39.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.4 on 193 degrees of freedom
## (588 observations deleted due to missingness)
## Multiple R-squared: 0.8897, Adjusted R-squared: 0.8891
## F-statistic: 1556 on 1 and 193 DF, p-value: < 2.2e-16
summary(model3)
##
## Call:
## lm(formula = FAA_U_RmvNA$distance ~ FAA_U_RmvNA$speed_ground +
## FAA_U_RmvNA$speed_air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -820.6 -182.0 7.7 204.2 633.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5425.49 209.08 -25.950 < 2e-16 ***
## FAA_U_RmvNA$speed_ground -12.32 12.98 -0.949 0.344
## FAA_U_RmvNA$speed_air 91.63 13.20 6.941 5.82e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.5 on 192 degrees of freedom
## (588 observations deleted due to missingness)
## Multiple R-squared: 0.8902, Adjusted R-squared: 0.889
## F-statistic: 778.1 on 2 and 192 DF, p-value: < 2.2e-16
#colinearity check between speed ground and speed air
vif(model3)
## FAA_U_RmvNA$speed_ground FAA_U_RmvNA$speed_air
## 43.1606 43.1606
faa<-na.omit(FAA_U_RmvNA)
#correlation analysis
cor(faa$speed_ground,faa$speed_air)#99 percent
## [1] 0.9883475
cor(FAA_U_RmvNA$speed_ground,FAA_U_RmvNA$distance)#86
## [1] 0.8620846
cor(faa$distance,faa$speed_air)#94
## [1] 0.943219
Model_lm_sa <- lm(distance ~ aircraft+speed_ground+speed_air+height, data=FAA_U_RmvNA)
summary(Model_lm_sa)#sg insig
##
## Call:
## lm(formula = distance ~ aircraft + speed_ground + speed_air +
## height, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -288.21 -94.55 12.00 85.48 336.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5955.217 109.297 -54.486 <2e-16 ***
## aircraft1 -433.427 19.806 -21.884 <2e-16 ***
## speed_ground -3.791 6.327 -0.599 0.55
## speed_air 85.846 6.430 13.351 <2e-16 ***
## height 13.750 1.036 13.268 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.5 on 190 degrees of freedom
## (588 observations deleted due to missingness)
## Multiple R-squared: 0.9743, Adjusted R-squared: 0.9737
## F-statistic: 1800 on 4 and 190 DF, p-value: < 2.2e-16
Model_lm <- lm(distance ~ aircraft+speed_ground+height, data=FAA_U_RmvNA)
summary(Model_lm)
##
## Call:
## lm(formula = distance ~ aircraft + speed_ground + height, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -718.73 -228.46 -93.75 124.88 1785.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1965.076 70.548 -27.85 <2e-16 ***
## aircraft1 -503.901 25.791 -19.54 <2e-16 ***
## speed_ground 41.941 0.678 61.86 <2e-16 ***
## height 13.938 1.325 10.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 360.2 on 779 degrees of freedom
## Multiple R-squared: 0.8417, Adjusted R-squared: 0.8411
## F-statistic: 1380 on 3 and 779 DF, p-value: < 2.2e-16
Model_lm_sg <- lm(distance ~ aircraft+speed_air+height, data=FAA_U_RmvNA)
summary(Model_lm_sg)
##
## Call:
## lm(formula = distance ~ aircraft + speed_air + height, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -293.22 -93.83 15.35 90.05 332.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5954.3835 109.1050 -54.58 <2e-16 ***
## aircraft1 -433.7406 19.7657 -21.94 <2e-16 ***
## speed_air 82.0393 0.9827 83.48 <2e-16 ***
## height 13.7913 1.0324 13.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.3 on 191 degrees of freedom
## (588 observations deleted due to missingness)
## Multiple R-squared: 0.9742, Adjusted R-squared: 0.9738
## F-statistic: 2408 on 3 and 191 DF, p-value: < 2.2e-16
Insight: * coefficient for Speed ground is approxilately half the value of coefficient of speed air. * When both predictors are taken togather the coefficient of speed ground is negative and speed air is positive * vif(): on speed ground and speed air reveals that these two have multicollinearity. * cor():Also there is a strong correlation between them: 99% * Upon creating a model with predictors : aircraft, speed_ground ,speed_air ,height. we observes that speed ground is insignificant. * Hence, upon trying to remove the speed ground from model we observe that Adj R2 improved.So, we can remove speed ground from our analysis on the basis of 2 factors: 1) It’s correlated to speed air and not landing distance. 2)p value is insignificant in full model
based on significance test
Suppose in Table 0, the variable ranking is as follows: X1, X2, X3… Please fit the following six models: Model 1: LD ~ X1 Model 2: LD ~ X1 + X2 Model 3: LD ~ X1 + X2 + X3
Calculate the R-squared for each model. Plot these R-squared values versus the number of variables p. What patterns do you observe? Step 18. Repeat Step 17 but use adjusted R-squared values instead. Step 19. Repeat Step 17 but use AIC values instead. Step 20. Compare the results in Steps 18-19, what variables would you select to build a predictive model for LD?
STEP 17-20
ModelCheck1<-lm(distance ~ aircraft, data=FAA_U_RmvNA)
ModelCheck2<-lm(distance ~ aircraft+speed_ground, data=FAA_U_RmvNA)
ModelCheck3<-lm(distance ~ aircraft+speed_ground+speed_air, data=FAA_U_RmvNA)
ModelCheck4<-lm(distance ~ aircraft+speed_ground+speed_air+height, data=FAA_U_RmvNA)
ModelCheck5<-lm(distance ~ aircraft+speed_ground+speed_air+height+pitch, data=FAA_U_RmvNA)
ModelCheck6<-lm(distance ~ aircraft+speed_ground+speed_air+height+pitch+duration, data=FAA_U_RmvNA)
ModelCheck7<-lm(distance ~ aircraft+speed_ground+speed_air+height+pitch+duration+no_pasg, data=FAA_U_RmvNA)
r2<-vector("numeric")
r2[1]<-summary(ModelCheck1)$r.squared
r2[2]<-summary(ModelCheck2)$r.squared
r2[3]<-summary(ModelCheck3)$r.squared
r2[4]<-summary(ModelCheck4)$r.squared
r2[5]<-summary(ModelCheck5)$r.squared
r2[6]<-summary(ModelCheck6)$r.squared
r2[7]<-summary(ModelCheck7)$r.squared
round(r2,3)
## [1] 0.052 0.819 0.950 0.974 0.974 0.974 0.975
nv<-c(1,2,3,4,5,6,7)
ar2<-vector("numeric")
ar2[1]<-summary(ModelCheck1)$adj.r.squared
ar2[2]<-summary(ModelCheck2)$adj.r.squared
ar2[3]<-summary(ModelCheck3)$adj.r.squared
ar2[4]<-summary(ModelCheck4)$adj.r.squared
ar2[5]<-summary(ModelCheck5)$adj.r.squared
ar2[6]<-summary(ModelCheck6)$adj.r.squared
ar2[7]<-summary(ModelCheck7)$adj.r.squared
aic<-vector("numeric")
aic[1]<-AIC(ModelCheck1)
aic[2]<-AIC(ModelCheck2)
aic[3]<-AIC(ModelCheck3)
aic[4]<-AIC(ModelCheck4)
aic[5]<-AIC(ModelCheck5)
aic[6]<-AIC(ModelCheck6)
aic[7]<-AIC(ModelCheck7)
paste(round(r2,5)*100,'%')
## [1] "5.229 %" "81.919 %" "95.046 %" "97.429 %" "97.436 %" "97.443 %"
## [7] "97.471 %"
paste(round(ar2,5)*100,'%')
## [1] "5.108 %" "81.873 %" "94.968 %" "97.375 %" "97.368 %" "97.362 %"
## [7] "97.376 %"
paste(round(aic,3))
## [1] "12843.84" "11548.698" "2597.804" "2471.938" "2473.383" "2474.836"
## [7] "2474.692"
par(mfrow=c(1,3))
plot(r2,nv,col="blue",xlab="R2",ylab="No.of predictors",main="R2 versus number of predictors",pch = 19, cex = 1)
plot(ar2,nv,col="blue",xlab="Adj R2",ylab="No.of predictors",main="Adjusted R2 versus number of predictors",pch = 19, cex = 1)
plot(aic,nv,col="blue",xlab="AIC",ylab="No.of predictors",main="AIC R2 versus number of predictors",pch = 19, cex = 1)
Insight:
R square is continiously inproving as we are adding predictors.Makes a pattern of a curve as seen in the graph Adj Rsquare is continiously inproving as we are adding predictors.Makes a pattern of a curve as seen in the graph, but adj r2 has stopped increasing considerably after model4 where all relevent and significant predictors are included.Post this model an addition of predictors does not bring much change in adj r2. AIC : we observe that AIC decreases considerably as predictors are added, but we also observe that after model4, AIC has stopped considerably decreasing.We understand that lower AIC signifies a the better model. This understanding is in line with our analysis. Conclusion: After our analysis using correlation,R2, Adj R2,AIC on the FAA data we have made a conclusion that the predictors that are selected are :
aircraft , speed_ground , speed_air , height
We should also not forget to mention the intrim analysis that we did on speed ground and speed air and found that:
Upon creating a model with predictors : aircraft, speed_ground ,speed_air ,height. we observes that speed ground is insignificant. Hence, upon trying to remove the speed ground from model we observe that Adj R2 improved.So, we can remove speed ground from our analysis on the basis of 2 factors: 1) It’s correlated to speed air and not landing distance. 2)p value is insignificant in full model
It’s my personal discretion to use speed air as a predictor, but as speed air has many missing values using speed ground is also acceptable.
Hence, our final model is :
aircraft , speed_air , height
Automatic Variable selection
This mechanism requires to remove all observation with missing values. Because speed air column has 75% missing values all those rows are removed from analysis.
Had we choose drop speed air column post the colinearity analysis, and only use speed ground in place of speed air this decrease in sample size would not occour.
Remember, we choose speed air because it was much highly correlated with Landing distance.
STEP 21
Use the R function “StepAIC” to perform forward variable selection. Compare the result with that in Step 19. Using both speed ground & speed air
# Drop rows with specific columns as NAs to avoid error in model building
model_data <- drop_na(FAA_U_RmvNA)
# Null Model - Regress square feet on only the intercept
nullmodel=lm(distance~1, data=model_data)
# Full Model - Regress square feet on all predictor variables
fullmodel=lm(distance~aircraft+speed_ground+speed_air+height+pitch+duration+no_pasg, data=model_data)
# Final Model built using stepwise variable selection
model_pred_FAA <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
direction='both')
## Start: AIC=2622.4
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_air 1 118926290 14749519 2194.6
## + speed_ground 1 115311069 18364740 2237.3
## + aircraft 1 3978580 129697229 2618.5
## <none> 133675809 2622.4
## + height 1 445916 133229893 2623.7
## + duration 1 367280 133308529 2623.9
## + pitch 1 154735 133521074 2624.2
## + no_pasg 1 141913 133533895 2624.2
##
## Step: AIC=2194.58
## distance ~ speed_air
##
## Df Sum of Sq RSS AIC
## + aircraft 1 8088045 6661474 2041.6
## + height 1 2623377 12126142 2158.4
## + pitch 1 847903 13901616 2185.0
## <none> 14749519 2194.6
## + no_pasg 1 142098 14607421 2194.7
## + speed_ground 1 68916 14680603 2195.7
## + duration 1 14495 14735024 2196.4
## - speed_air 1 118926290 133675809 2622.4
##
## Step: AIC=2041.58
## distance ~ speed_air + aircraft
##
## Df Sum of Sq RSS AIC
## + height 1 3217699 3443775 1914.9
## <none> 6661474 2041.6
## + duration 1 61424 6600051 2041.8
## + no_pasg 1 47410 6614064 2042.2
## + speed_ground 1 39437 6622037 2042.4
## + pitch 1 14544 6646931 2043.2
## - aircraft 1 8088045 14749519 2194.6
## - speed_air 1 123035754 129697229 2618.5
##
## Step: AIC=1914.92
## distance ~ speed_air + aircraft + height
##
## Df Sum of Sq RSS AIC
## + no_pasg 1 39886 3403889 1914.7
## <none> 3443775 1914.9
## + duration 1 12694 3431082 1916.2
## + pitch 1 8137 3435638 1916.5
## + speed_ground 1 6494 3437282 1916.5
## - height 1 3217699 6661474 2041.6
## - aircraft 1 8682367 12126142 2158.4
## - speed_air 1 125653520 129097295 2619.6
##
## Step: AIC=1914.65
## distance ~ speed_air + aircraft + height + no_pasg
##
## Df Sum of Sq RSS AIC
## <none> 3403889 1914.7
## - no_pasg 1 39886 3443775 1914.9
## + duration 1 9734 3394155 1916.1
## + pitch 1 8832 3395057 1916.1
## + speed_ground 1 5824 3398066 1916.3
## - height 1 3210175 6614064 2042.2
## - aircraft 1 8588152 11992041 2158.2
## - speed_air 1 125626785 129030674 2621.5
INSIGHT:
Automatic method gives that : distance ~ speed_air + aircraft + height + no_pasg
Dropping speed air
# Drop rows with specific columns as NAs to avoid error in model building
model_data <- FAA_U_RmvNA
# Null Model - Regress square feet on only the intercept
nullmodel=lm(distance~1, data=model_data)
# Full Model - Regress square feet on all predictor variables
fullmodel=lm(distance~aircraft+speed_ground+height+pitch+duration+no_pasg, data=model_data)
# Final Model built using stepwise variable selection
model_pred_FAA <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel),
direction='both')
## Start: AIC=10659.83
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_ground 1 474544306 163979281 9597.4
## + aircraft 1 33387572 605136015 10619.8
## + height 1 6947258 631576329 10653.3
## + pitch 1 2946920 635576667 10658.2
## + duration 1 1728559 636795028 10659.7
## <none> 638523587 10659.8
## + no_pasg 1 170040 638353546 10661.6
##
## Step: AIC=9597.41
## distance ~ speed_ground
##
## Df Sum of Sq RSS AIC
## + aircraft 1 48531001 115448279 9324.6
## + height 1 13348111 150631170 9532.9
## + pitch 1 8648983 155330297 9557.0
## <none> 163979281 9597.4
## + no_pasg 1 263028 163716252 9598.2
## + duration 1 36384 163942897 9599.2
## - speed_ground 1 474544306 638523587 10659.8
##
## Step: AIC=9324.64
## distance ~ speed_ground + aircraft
##
## Df Sum of Sq RSS AIC
## + height 1 14355255 101093024 9222.7
## <none> 115448279 9324.6
## + pitch 1 207422 115240857 9325.2
## + no_pasg 1 94868 115353412 9326.0
## + duration 1 16936 115431343 9326.5
## - aircraft 1 48531001 163979281 9597.4
## - speed_ground 1 489687736 605136015 10619.8
##
## Step: AIC=9222.67
## distance ~ speed_ground + aircraft + height
##
## Df Sum of Sq RSS AIC
## <none> 101093024 9222.7
## + no_pasg 1 205566 100887458 9223.1
## + pitch 1 90919 101002105 9224.0
## + duration 1 10794 101082231 9224.6
## - height 1 14355255 115448279 9324.6
## - aircraft 1 49538146 150631170 9532.9
## - speed_ground 1 496573808 597666832 10612.1
INSIGHT:
Automatic method gives that : distance ~ speed_ground + aircraft + height + no_pasg
Manual process of selection suggested that singnificant predictors are:
aircraft , speed_air , height Automatic method gives that :
distance ~ speed_air + aircraft + height + no_pasg
We observe a discrepancy or no_pasg here,lets try to create this new model and check the addition of no_pasg variable.
model_corr<-lm(distance ~ speed_air + aircraft + height + no_pasg,data=FAA_U_RmvNA)
summary(model_corr)
##
## Call:
## lm(formula = distance ~ speed_air + aircraft + height + no_pasg,
## data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -290.38 -89.86 8.45 85.94 358.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5831.6802 136.3471 -42.771 <2e-16 ***
## speed_air 82.0317 0.9796 83.739 <2e-16 ***
## aircraft1 -432.0737 19.7342 -21.895 <2e-16 ***
## height 13.7758 1.0291 13.386 <2e-16 ***
## no_pasg -2.0410 1.3679 -1.492 0.137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 133.8 on 190 degrees of freedom
## (588 observations deleted due to missingness)
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.974
## F-statistic: 1818 on 4 and 190 DF, p-value: < 2.2e-16
Insight:
We see that no_pasg is not significant as the pvalue is higher.Moreover Adj R square has also increased(slightly). Hence, it better we do not consider this predictor in our model. Landing Distance ~ aircraft, speed_air ,height
Model_lm_sg <- lm(distance ~ aircraft+speed_air+height, data=FAA_U_RmvNA)
summary(Model_lm_sg)
##
## Call:
## lm(formula = distance ~ aircraft + speed_air + height, data = FAA_U_RmvNA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -293.22 -93.83 15.35 90.05 332.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5954.3835 109.1050 -54.58 <2e-16 ***
## aircraft1 -433.7406 19.7657 -21.94 <2e-16 ***
## speed_air 82.0393 0.9827 83.48 <2e-16 ***
## height 13.7913 1.0324 13.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.3 on 191 degrees of freedom
## (588 observations deleted due to missingness)
## Multiple R-squared: 0.9742, Adjusted R-squared: 0.9738
## F-statistic: 2408 on 3 and 191 DF, p-value: < 2.2e-16