Statistical Modeling Assignment 1

Group Members : Utsav Raj, Rohit Menon, Satarupa Saha, Mounish Sunkara

Data set: Landing data from 950 commercial flights

Data dictionary

• aircraft: The make of the aircraft (e.g., Boeing or Airbus).

• duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.

• no_pasg: The number of passengers in the flight.

• speed_ground (in miles per hour): The ground speed of the aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

• speed_air (in miles per hour): The air speed of the aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.

• height (in meters): The height of the aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.

• pitch (in degrees): Pitch angle of the aircraft when it is passing over the threshold of the runway.

• distance (in feet): The landing distance of the aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

0.1 Part I:

0.1.1 Initial exploration

Read the two files FAA1.xls (800 flights) and FAA2.xls into R

Code

library(readxl)
FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")

Check the structure of each data set using the str() function. For each data set, what is the sample size and number of variables? Are there any differences between the two data sets?

Code

str(FAA1)

tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
 $ aircraft    : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
 $ duration    : num [1:800] 98.5 125.7 112 196.8 90.1 ...
 $ no_pasg     : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
 $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
 $ speed_air   : num [1:800] 109 103 NA NA NA ...
 $ height      : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
 $ pitch       : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
 $ distance    : num [1:800] 3370 2988 1145 1664 1050 ...

Code

str(FAA2)

tibble [150 × 7] (S3: tbl_df/tbl/data.frame)
 $ aircraft    : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
 $ no_pasg     : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
 $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
 $ speed_air   : num [1:150] 109 103 NA NA NA ...
 $ height      : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
 $ pitch       : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
 $ distance    : num [1:150] 3370 2988 1145 1664 1050 ...

Data set FAA1 contains 800 rows and 8 columns while FAA2 has 150 rows and 7 columns and duration column is missing from FAA2 file.

Merge the two data sets. Are there any duplicate observations? If there are duplicate observations,what actions would you take before doing any further analysis?

Code

Final_FAA<-bind_rows( FAA1, FAA2)

Final_FAA$dup_flag<-as.numeric(duplicated(Final_FAA[, c(1, 3:8)]))

Final_FAA<-Final_FAA%>%filter(dup_flag==0)

Final_FAA<- Final_FAA[, 1:8]

We found 100 duplicates after merging the files.For further analysis we have removed the duplicate records.

Check the structure of the combined data set. Provide an appropriate (and concise) summary of each variable. (Graphics would be great, too!)

Structure of the combined Data Set

Code

str(Final_FAA)

tibble [850 × 8] (S3: tbl_df/tbl/data.frame)
 $ aircraft    : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
 $ duration    : num [1:850] 98.5 125.7 112 196.8 90.1 ...
 $ no_pasg     : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
 $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
 $ speed_air   : num [1:850] 109 103 NA NA NA ...
 $ height      : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
 $ pitch       : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
 $ distance    : num [1:850] 3370 2988 1145 1664 1050 ...

Summary of the combined Data Set

Code

knitr::kable(summary(Final_FAA))

aircraft	duration	no_pasg	speed_ground	speed_air	height	pitch	distance
Length:850	Min. : 14.76	Min. :29.0	Min. : 27.74	Min. : 90.00	Min. :-3.546	Min. :2.284	Min. : 34.08
Class :character	1st Qu.:119.49	1st Qu.:55.0	1st Qu.: 65.90	1st Qu.: 96.25	1st Qu.:23.314	1st Qu.:3.642	1st Qu.: 883.79
Mode :character	Median :153.95	Median :60.0	Median : 79.64	Median :101.15	Median :30.093	Median :4.008	Median :1258.09
NA	Mean :154.01	Mean :60.1	Mean : 79.45	Mean :103.80	Mean :30.144	Mean :4.009	Mean :1526.02
NA	3rd Qu.:188.91	3rd Qu.:65.0	3rd Qu.: 92.06	3rd Qu.:109.40	3rd Qu.:36.993	3rd Qu.:4.377	3rd Qu.:1936.95
NA	Max. :305.62	Max. :87.0	Max. :141.22	Max. :141.72	Max. :59.946	Max. :5.927	Max. :6533.05
NA	NA’s :50	NA	NA	NA’s :642	NA	NA	NA

Box Plot:

Code

par(mfrow=c(2,4))
for (i in 2:ncol(Final_FAA)){
  boxplot(Final_FAA[,i],main=colnames(Final_FAA[i])) 
}

At this point, you are asked by FAA agents to prepare ONE presentation slide to summarize your findings (3–7 bullet points will suffice). What observations will you bring to the attention of the agents?

2 types of aircraft : Boeing and Airbus
Airspeed is not captured for 75% of data
Duration is not captured in FAA2 file
Removed 100 duplicates after merging both the files
Data irregularity for other columns. Min Height is negative, min value of distance is 14 mins ,speed group is also abnormal for few cases

0.2 Part II: Data cleaning and further exploration

Are there any “abnormal” values in the data set? (You can refer to the data dictionary for criteria defining “normal/abnormal” values for each column.) Remove any rows that contain any “abnormal” values and report how many rows you have removed.

Code

Filter_Final_FAA <- Final_FAA %>% 
                    filter(((duration>40) | is.na(duration)) & 
                    ((speed_ground>=30 & speed_ground<=140)| is.na(speed_ground)) & 
                    ((speed_air>=30 & speed_air<=140)| is.na(speed_air))&
                    ((height>=6)| is.na(height))&
                    ((distance<6000)| is.na(distance)))

We have filtered 19 rows based based on the criteria provided in the data dictionary.

Repeat Step 4. Can you find a single visualization that helps to summarize the data?

Code

summary(Filter_Final_FAA)

   aircraft            duration         no_pasg       speed_ground   
 Length:831         Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
 Class :character   1st Qu.:119.63   1st Qu.:55.00   1st Qu.: 66.20  
 Mode  :character   Median :154.28   Median :60.00   Median : 79.79  
                    Mean   :154.78   Mean   :60.06   Mean   : 79.54  
                    3rd Qu.:189.66   3rd Qu.:65.00   3rd Qu.: 91.91  
                    Max.   :305.62   Max.   :87.00   Max.   :132.78  
                    NA's   :50                                       
   speed_air          height           pitch          distance      
 Min.   : 90.00   Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
 1st Qu.: 96.23   1st Qu.:23.530   1st Qu.:3.640   1st Qu.: 893.28  
 Median :101.12   Median :30.167   Median :4.001   Median :1262.15  
 Mean   :103.48   Mean   :30.458   Mean   :4.005   Mean   :1522.48  
 3rd Qu.:109.36   3rd Qu.:37.004   3rd Qu.:4.370   3rd Qu.:1936.63  
 Max.   :132.91   Max.   :59.946   Max.   :5.927   Max.   :5381.96  
 NA's   :628

Code

hist.data.frame(Filter_Final_FAA)

Prepare another presentation slide (or 3–7 bullet points) to summarize your findings obtained from the cleaned data set.

We have filtered 19 rows based based on the criteria provided in the data dictionary.
Airspeed is not captured for 75% of data
Duration is not captured in FAA2 file
Distribution of all variables other than air speed and distance seem to follow normal distribution. Air Speed and distance are right skewed.

0.3 Part III: Initial analysis for identifying important factors that impact the response variable distance

Compute the pairwise correlation between the distance and each predictor (i.e., every other column; consider re-encoding aircraft as 0/1 for boeing/airbus). Provide a table that ranks the predictors in descending order based on the strength (i.e., absolute value) of the correlations. This table should contain three columns: • the names of variables • the size of the correlation • the direction of the correlation (i.e., positive or negative).

Code

Filter_Final_FAA_encode <- Filter_Final_FAA
Filter_Final_FAA_encode$aircraft <- ifelse(Filter_Final_FAA_encode$aircraft =="boeing", 0, 1)

table1 <- data.frame(cor(Filter_Final_FAA_encode[1:7], Filter_Final_FAA_encode$distance,use="complete.obs"))

colnames(table1) <- c("Size_Of_Correlation")

table1 <- tibble::rownames_to_column(table1, "Variable") %>% 
  mutate(Correlation_Dir = ifelse(Size_Of_Correlation < 0, "-ve","+ve"))%>%
  arrange(desc(abs(Size_Of_Correlation)))

table1

      Variable Size_Of_Correlation Correlation_Dir
1    speed_air          0.94321897             +ve
2 speed_ground          0.92877195             +ve
3     aircraft         -0.17251928             -ve
4       height          0.05775639             +ve
5     duration          0.05241698             +ve
6        pitch          0.03402263             +ve
7      no_pasg         -0.03258255             -ve

Construct a scatter plot between the response and each predictor. Do you think the correlation strength observed in these plots is consistent with the results displayed in Table 1?

Code

# Air Speed
plot_1 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = speed_air)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "speed_air", title = "")

# Ground Speed
plot_2 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = speed_ground)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "speed_ground", title = "")

# Height
plot_3 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = height)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "height", title = "")

# Pitch
plot_4 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = pitch)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "pitch", title = "")

# Duration
plot_5 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = duration)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "duration", title = "")

# Number of Passengers
plot_6 <- ggplot(Filter_Final_FAA_encode, aes(x = distance, y = no_pasg)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "no_pasg", title = "")


gridExtra::grid.arrange(plot_1, plot_2, plot_3, plot_4, plot_5, plot_6, ncol = 3)

Discuss any issues of including aircraft in this analysis. Does the correlation analysis make sense? How else might you analyse this variable and it’s associative importance?

We see strong correlation between distance and ground speed and air speed.

0.4 Part IV: Regression using a single factor each time

Regress distance on each of the predictors. Provide a table that ranks the factors based on its significance. The smaller the p-value, the more “important” the predictor. Organize the results in a new table, called Table 2, that’s similar to Table 1.

Code

model<-lapply(c(1:7), function(i) lm(Filter_Final_FAA_encode[["distance"]]~Filter_Final_FAA[[i]]))

pvalues<-lapply(c(1:7), function(i) summary(model[[i]])$coefficients[2,4])

pvalues<- unlist(pvalues)
#pvalues

table2<-data.frame(colnames(Filter_Final_FAA_encode[,1:7]), pvalues)

names(table2)[1] <- 'Variable'
table2<- table2%>%arrange(pvalues)

table2

      Variable       pvalues
1 speed_ground 4.766371e-252
2    speed_air  2.500461e-97
3     aircraft  3.526194e-12
4       height  4.123860e-03
5        pitch  1.208124e-02
6     duration  1.514002e-01
7      no_pasg  6.092520e-01

Standardize each predictor by subtracting its mean and dividing by its standard deviation (i.e., so each predictor is centered with unit variance). Regress distance on each of the standardized predictors. Provide a table that ranks the predictors based on the size of the regression coefficients. The larger the size, the more “important” the predictor. Store the results in a new table, called Table 3; be sure to include a third column giving the sign of the corresponding coefficient.

Code

FAA_Stand<-data.frame(scale(Filter_Final_FAA_encode[,1:7]), Filter_Final_FAA_encode$distance)
names(FAA_Stand)[8] <- 'distance'

model_stand<-lapply(c(1:7), function(i) lm(Filter_Final_FAA_encode[["distance"]]~FAA_Stand[[i]]))

coef<-lapply(c(1:7), function(i) summary(model_stand[[i]])$coefficients[2,1])

coef<- unlist(coef)

table3<-data.frame(colnames(Filter_Final_FAA_encode[,1:7]), coef)

names(table3)[1] <- 'Variable'

table3<- table3%>%arrange(desc(abs(coef)))%>% 
  mutate(coef_direction = ifelse(coef < 0, "-ve","+ve"))

table3

      Variable       coef coef_direction
1 speed_ground  776.44740            +ve
2    speed_air  774.34654            +ve
3     aircraft -213.45802            -ve
4       height   89.10606            +ve
5        pitch   78.00693            +ve
6     duration  -46.48013            -ve
7      no_pasg  -15.91595            -ve

Compare Tables 1, 2, and 3. Are the results consistent? At this point, you will meet with the FAA agents again. Please provide a single table than ranks all the predictors based on their relative importance in predicting distance. We’ll call it Table 0.

Code

table0<-merge(x=table1,y=table2,by="Variable")
table0<-merge(x=table0,y=table3,by="Variable")

table0

      Variable Size_Of_Correlation Correlation_Dir       pvalues       coef
1     aircraft         -0.17251928             -ve  3.526194e-12 -213.45802
2     duration          0.05241698             +ve  1.514002e-01  -46.48013
3       height          0.05775639             +ve  4.123860e-03   89.10606
4      no_pasg         -0.03258255             -ve  6.092520e-01  -15.91595
5        pitch          0.03402263             +ve  1.208124e-02   78.00693
6    speed_air          0.94321897             +ve  2.500461e-97  774.34654
7 speed_ground          0.92877195             +ve 4.766371e-252  776.44740
  coef_direction
1            -ve
2            -ve
3            +ve
4            -ve
5            +ve
6            +ve
7            +ve

0.5 Part V: Checking for multicollinearity

Compare the regression coefficients of the three models below: Do you observe any potential problems in the sign/significance of each term? Check the correlation between speed_ground and speed_air. If you had to choose between one of these two variables to include in the model, which would it be and why? What other methods could you use to deal with potential multicollinearity? Are there any drawbacks to these other methods?

Code

model_1 <- lm(distance ~ speed_ground, data = FAA_Stand)
model_2 <- lm(distance ~ speed_air, data = FAA_Stand)
model_3 <- lm(distance ~ speed_ground + speed_air, data = FAA_Stand)

summary(model_1)$coefficients[,1]

 (Intercept) speed_ground 
   1522.4829     776.4474

Code

summary(model_2)$coefficients[,1]

(Intercept)   speed_air 
  2774.6729    774.3465

Code

summary(model_3)$coefficients[,1]

 (Intercept) speed_ground    speed_air 
   3117.7445    -269.2959     914.8089

Code

GGally::ggpairs(data = FAA_Stand, columns =4:5 )

Positive coefficients and significant when we are doing regress individually Negative coefficients for speed ground and p value > 0.05 for model 3 High correlation b/w air speed and ground speed. ~0.98 We should select ground speed. 75% values are missing for air speed.

0.6 Part VI: Variable selection based on Table 0

Suppose in Table 0, the variable ranking is as follows: X1, X2, X3, X4, X5, and X6, where X1 and X6 are the most important and least important predictors, respectively. Please fit the following six models:

Code

var_selection <-c()

var_selection[1] <- summary(lm(distance ~ speed_ground , data = FAA_Stand))$r.squared     
var_selection[2] <- summary(lm(distance ~ speed_ground + aircraft  , data = FAA_Stand))$r.squared 
var_selection[3] <- summary(lm(distance ~ speed_ground + aircraft + height, data = FAA_Stand))$r.squared
var_selection[4] <- summary(lm(distance ~ speed_ground + aircraft + height + duration , data = FAA_Stand))$r.squared
var_selection[5] <- summary(lm(distance ~ speed_ground + aircraft + height + duration + pitch , data = FAA_Stand))$r.squared
var_selection[6] <- summary(lm(distance ~ speed_ground + aircraft + height + duration + pitch + no_pasg, data = FAA_Stand))$r.squared

parameters <-c(1:6)

var_selection_result <- data.frame(var_selection,parameters)
var_selection_result

  var_selection parameters
1     0.7503784          1
2     0.8251319          2
3     0.8488989          3
4     0.8503048          4
5     0.8504184          5
6     0.8506023          6

Code

var_selection_result %>% 
  ggplot(aes(x = parameters , y = var_selection)) + 
  geom_line() + 
  xlab("no. of variables") +
  ylab("R-square") +
  theme_classic()

After 3 variables the r sqr value is not increasing. This implies that new variable addition is not able to explain the variation.

Repeat 17), but use adjusted R2 instead.

Code

var_selection_adj <-c()

var_selection_adj[1] <- summary(lm(distance ~ speed_ground , data = FAA_Stand))$adj.r.squared     
var_selection_adj[2] <- summary(lm(distance ~ speed_ground + aircraft  , data = FAA_Stand))$adj.r.squared 
var_selection_adj[3] <- summary(lm(distance ~ speed_ground + aircraft + height, data = FAA_Stand))$adj.r.squared
var_selection_adj[4] <- summary(lm(distance ~ speed_ground + aircraft + height + duration , data = FAA_Stand))$adj.r.squared
var_selection_adj[5] <- summary(lm(distance ~ speed_ground + aircraft + height + duration + pitch , data = FAA_Stand))$adj.r.squared
var_selection_adj[6] <- summary(lm(distance ~ speed_ground + aircraft + height + duration + pitch + no_pasg, data = FAA_Stand))$adj.r.squared

parameters <-c(1:6)

var_selection_result_adj <- data.frame(var_selection_adj,parameters)
var_selection_result_adj

  var_selection_adj parameters
1         0.7500773          1
2         0.8247095          2
3         0.8483508          3
4         0.8495332          4
5         0.8494534          5
6         0.8494442          6

Code

var_selection_result_adj %>% 
  ggplot(aes(x = parameters , y = var_selection_adj)) + 
  geom_line() + 
  xlab("no. of variables") +
  ylab("Adjusted R-square") +
  theme_classic()

Repeat 17. but use AIC instead

Code

aic_method <- c()
aic_method[1] <- AIC(lm(distance ~ speed_ground , data = FAA_Stand))
aic_method[2] <- AIC(lm(distance ~ speed_ground + aircraft  , data = FAA_Stand))
aic_method[3] <- AIC(lm(distance ~ speed_ground + aircraft + height, data = FAA_Stand))
aic_method[4] <- AIC(lm(distance ~ speed_ground + aircraft + height + duration, data = FAA_Stand))
aic_method[5] <- AIC(lm(distance ~ speed_ground + aircraft + height + duration + pitch , data = FAA_Stand))
aic_method[6] <- AIC(lm(distance ~ speed_ground + aircraft + height + duration + pitch + no_pasg, data = FAA_Stand))

aic_method

[1] 12508.81 12215.05 12095.65 11377.43 11378.84 11379.88

Code

parameters <-c(1:6)

aic_method_selection <- data.frame(aic_method,parameters)
aic_method_selection

  aic_method parameters
1   12508.81          1
2   12215.05          2
3   12095.65          3
4   11377.43          4
5   11378.84          5
6   11379.88          6

Code

aic_method_selection %>% 
  ggplot(aes(x = parameters , y = aic_method)) + 
  geom_line() + 
  xlab("no. of variables") +
  ylab("AIC Value") +
  theme_classic()

AIC decreases till model with 4 variables. Then its value remains almost the same.As per AIC we should pick 4 variables.

Compare the results 17–19. What variables would you select to build a predictive model for predicting distance? Any problems with this approach?

Based on the results we should pick Speed Ground, Aircraft and height

0.7 Part VII: Variable selection using automatic search procedures

Use the R function StepAIC() from package MASS to perform forward variable selection; see ?MASS:stepAIC for details and usage examples (or ask in the class Teams chat). Compare the result with those from 19).