Questions Part I: Initial exploration 1. Read the two files FAA1.xls (800 flights) and FAA2.xls into R. 2. Check the structure of each data set using the str() function. For each data set, what is the sample size and number of variables? Are there any differences between the two data sets? 3. Merge the two data sets. Are there any duplicate observations? If there are duplicate observations, what actions would you take before doing any further analysis? 4. Check the structure of the combined data set. Provide an appropriate (and concise) summary of each variable. (Graphics would be great, too!) 5. At this point, you are asked by FAA agents to prepare ONE presentation slide to summarize your findings (3–7 bullet points will suffice). What observations will you bring to the attention of the agents?

library(readxl)
## Warning: package 'readxl' was built under R version 4.3.2
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")
str(FAA1)
## tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:800] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:800] 109 103 NA NA NA ...
##  $ height      : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:800] 3370 2988 1145 1664 1050 ...
str(FAA2)
## tibble [150 × 7] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:150] 109 103 NA NA NA ...
##  $ height      : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:150] 3370 2988 1145 1664 1050 ...
  1. FAA1 dataset has 800 observations and 8 variables. FAA2 dataset has 150 observations and 7 variables.

FAA1 dataset has duration variable while FAA2 dataset does not.

FAA1$duration <- NULL
FAA <- bind_rows(FAA1, FAA2)
FAA$dup_flag <- as.numeric(duplicated(FAA))
FAA <- FAA[FAA$dup_flag == 0, ]
FAA <- FAA[, 1:7]
  1. There are 100 duplicated values after merging the 2 FAA files. We have removed the duplicated values prior to doing any further analysis.

str(FAA)
## tibble [850 × 7] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:850] 109 103 NA NA NA ...
##  $ height      : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:850] 3370 2988 1145 1664 1050 ...
summary(FAA)
##    aircraft            no_pasg      speed_ground      speed_air     
##  Length:850         Min.   :29.0   Min.   : 27.74   Min.   : 90.00  
##  Class :character   1st Qu.:55.0   1st Qu.: 65.90   1st Qu.: 96.25  
##  Mode  :character   Median :60.0   Median : 79.64   Median :101.15  
##                     Mean   :60.1   Mean   : 79.45   Mean   :103.80  
##                     3rd Qu.:65.0   3rd Qu.: 92.06   3rd Qu.:109.40  
##                     Max.   :87.0   Max.   :141.22   Max.   :141.72  
##                                                     NA's   :642     
##      height           pitch          distance      
##  Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :59.946   Max.   :5.927   Max.   :6533.05  
## 
par(mfrow=c(2,4))
for (i in which(sapply(FAA, is.numeric))){
  boxplot(FAA[,i], main=colnames(FAA[i]), ylab="Value") 
}

  1. • There are 2 types of aircrafts: Boeing and Airbus • The min speed ground is less than 30 MPH and max speed ground is greater than 140 MPH. Both are considered abnormal. • The max speed air is greater than 140 MPH which could increase the risk of landing overrun. There are 642 missing values. • The min height value is -3.546 which is not possible. • The max distance value is greater than the typical runway length of less than 6000 feet which means the landing overrun risk is high. The min distance value of 34.08 does not seem possible either.

Part II: Data cleaning and further exploration 6. Are there any “abnormal” values in the data set? (You can refer to the data dictionary for criteria defining “normal/abnormal” values for each column.) Remove any rows that contain any “abnormal” values and report how many rows you have removed.

FAA_clean <- subset(FAA, 
                    (is.na(speed_ground) | (speed_ground >= 30 & speed_ground <= 140)) &
                    (is.na(speed_air) | (speed_air >= 30 & speed_air <= 140)) &
                    (is.na(height) | height >= 6) &
                    (is.na(distance) | distance < 6000))

rows_removed <- nrow(FAA) - nrow(FAA_clean)

We have removed 14 rows that contain the abnormal values.

  1. Repeat Step 4. Can you find a single visualization that helps to summarize the data?
str(FAA_clean)
## tibble [836 × 7] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:836] "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num [1:836] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:836] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:836] 109 103 NA NA NA ...
##  $ height      : num [1:836] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:836] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:836] 3370 2988 1145 1664 1050 ...
summary(FAA_clean)
##    aircraft            no_pasg       speed_ground      speed_air     
##  Length:836         Min.   :29.00   Min.   : 33.57   Min.   : 90.00  
##  Class :character   1st Qu.:55.00   1st Qu.: 66.20   1st Qu.: 96.21  
##  Mode  :character   Median :60.00   Median : 79.83   Median :101.11  
##                     Mean   :60.04   Mean   : 79.59   Mean   :103.46  
##                     3rd Qu.:65.00   3rd Qu.: 92.11   3rd Qu.:109.33  
##                     Max.   :87.00   Max.   :132.78   Max.   :132.91  
##                                                      NA's   :630     
##      height           pitch          distance      
##  Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
##  1st Qu.:23.578   1st Qu.:3.640   1st Qu.: 895.83  
##  Median :30.210   Median :4.002   Median :1263.54  
##  Mean   :30.510   Mean   :4.005   Mean   :1526.05  
##  3rd Qu.:37.038   3rd Qu.:4.370   3rd Qu.:1939.68  
##  Max.   :59.946   Max.   :5.927   Max.   :5381.96  
## 
par(mfrow=c(2,4))
for(i in which(sapply(FAA_clean, is.numeric))){
  hist(FAA_clean[,i][!is.na(FAA_clean[,i])], main=colnames(FAA_clean[i]), xlab="Value", col="blue")
}

  1. Prepare another presentation slide (or 3–7 bullet points) to summarize your findings obtained from the cleaned data set. • There are now 836 observations that are normal based on data dictionary requirements. • 14 rows of abnormal data were removed from the FAA_clean dataset. • There are 642 missing values in speed_air. • All variables seem to follow normal distribution except for speed_air and distance.

Part III: Initial analysis for identifying important factors that impact the re- sponse variable distance 10. Compute the pairwise correlation between the distance and each predictor (i.e., every other column; consider re-encoding aircraft as 0/1 for boeing/airbus). Provide a table that ranks the predictors in descending order based on the strength (i.e., absolute value) of the correlations. This table should contain three columns: • the names of variables • the size of the correlation • the direction of the correlation (i.e., positive or negative). Refer to this table as Table 1, which will be used for comparison with our analysis later.

attach(FAA_clean)
FAA_clean$aircraft <- as.numeric(ifelse(FAA_clean$aircraft == "boeing", 0, 1))

correlations <- cor(FAA_clean[,sapply(FAA_clean, is.numeric)], use = "complete.obs")

distance_correlations <- correlations[,"distance"]

distance_correlations <- distance_correlations[!names(distance_correlations) %in% "distance"]

correlation_df <- data.frame(
  Variable = names(distance_correlations),
  Correlation = distance_correlations,
  Direction = ifelse(distance_correlations > 0, "Positive", "Negative")
)

table1 <- correlation_df[order(abs(correlation_df$Correlation), decreasing = TRUE),]

table1
##                  Variable Correlation Direction
## speed_air       speed_air  0.94138896  Positive
## speed_ground speed_ground  0.92551427  Positive
## aircraft         aircraft -0.18678435  Negative
## height             height  0.07308959  Positive
## pitch               pitch  0.04823992  Positive
## no_pasg           no_pasg -0.03737893  Negative
  1. Construct a scatter plot between the response and each predictor. Do you think the correlation strength observed in these plots is consistent with the results displayed in Table 1?
# Aircraft
plot_1 <- ggplot(FAA_clean, aes(x = distance, y = aircraft)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "aircraft", title = "")

# Air Speed
plot_2 <- ggplot(FAA_clean, aes(x = distance, y = speed_air)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "speed_air", title = "")

# Ground Speed
plot_3 <- ggplot(FAA_clean, aes(x = distance, y = speed_ground)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "speed_ground", title = "")

# Height
plot_4 <- ggplot(FAA_clean, aes(x = distance, y = height)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "height", title = "")

# Pitch
plot_5 <- ggplot(FAA_clean, aes(x = distance, y = pitch)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "pitch", title = "")

# Number of Passengers
plot_6 <- ggplot(FAA_clean, aes(x = distance, y = no_pasg)) +
  geom_point(alpha = 0.9,color='blue') + 
  labs(y = "no_pasg", title = "")

gridExtra::grid.arrange(plot_1, plot_2, plot_3, plot_4, plot_5, plot_6, ncol = 3)
## Warning: Removed 630 rows containing missing values (`geom_point()`).

We think that the correlation strength observed in these plots is consistent with the results displayed in Table 1.

  1. Discuss any issues of including aircraft in this analysis. Does the correlation analysis make sense? How else might you analyse this variable and it’s associative importance?

We think including aircraft in this analysis could lead to a few issues: - Binary encoding of aircraft would assume both Boeing and Airbus are the same which might not be true. Boeing aircrafts could be bigger or smaller and vice versa. - Correlation might not work here with a binary variable.

We think we could use interaction effects (i.e. if the effect of other variables on the distance response is different for Boeing and Airbus) and linear regression.

Part IV: Regression using a single factor each time 13. Regress distance on each of the predictors. Provide a table that ranks the factors based on its significance. The smaller the p-value, the more “important” the predictor. Organize the results in a new table, called Table 2, that’s similar to Table 1.

results <- data.frame(Variable = character(), P_Value = numeric())

predictors <- c("aircraft", "speed_air", "speed_ground", "height", "pitch", "no_pasg")

for (var in predictors) {
  model <- lm(paste("distance ~", var), data = FAA_clean)

  p_value <- summary(model)$coefficients[2, 4]
  
  results <- rbind(results, data.frame(Variable = var, P_Value = p_value))
}

table2 <- results[order(results$P_Value), ]

table2
##       Variable       P_Value
## 3 speed_ground 5.079768e-254
## 2    speed_air  3.118417e-98
## 1     aircraft  1.941505e-12
## 4       height  1.822986e-03
## 5        pitch  7.079296e-03
## 6      no_pasg  5.414166e-01
  1. Standardize each predictor by subtracting its mean and dividing by its standard deviation (i.e., so each predictor is centered with unit variance). Regress distance on each of the standardized predictors. Provide a table that ranks the predictors based on the size of the regression coefficients. The larger the size, the more “important” the predictor. Store the results in a new table, called Table 3; be sure to include a third column giving the sign of the corresponding coefficient.
results2 <- data.frame(Variable = character(), Coefficient = numeric(), Sign = character())

FAA_clean_standardized <- data.frame(scale(FAA_clean[predictors]))

for (var in predictors) {
  model <- lm(paste("distance ~", var), data = FAA_clean_standardized)
  
  coefficient <- summary(model)$coefficients[2, 1]
  
  sign <- ifelse(coefficient > 0, "Positive", "Negative")
  
  results2 <- rbind(results2, data.frame(Variable = var, Coefficient = coefficient, Sign = sign))
}

table3 <- results2[order(-abs(results2$Coefficient)), ]

table3
##       Variable Coefficient     Sign
## 3 speed_ground   778.57803 Positive
## 2    speed_air   771.26509 Positive
## 1     aircraft  -215.81909 Negative
## 4       height    96.73571 Positive
## 5        pitch    83.62405 Positive
## 6      no_pasg   -19.00133 Negative
  1. Compare Tables 1, 2, and 3. Are the results consistent? At this point, you will meet with the FAA agents again. Please provide a single table than ranks all the predictors based on their relative importance in predicting distance. We’ll call it Table 0.
table1$Rank <- rank(-abs(table1$Correlation))
table2$Rank <- rank(-log10(table2$P_Value))
table3$Rank <- rank(-abs(table3$Coefficient))

combined <- rbind(table1[, c("Variable", "Rank")],
                  table2[, c("Variable", "Rank")],
                  table3[, c("Variable", "Rank")])

average_rank <- aggregate(Rank ~ Variable, data = combined, FUN = mean)

table0 <- average_rank[order(average_rank$Rank), ]

table0
##       Variable     Rank
## 5    speed_air 2.666667
## 6 speed_ground 3.000000
## 1     aircraft 3.333333
## 2       height 3.666667
## 4        pitch 4.000000
## 3      no_pasg 4.333333

Table 1, table 2 and table all agree that speed_air and speed_ground have positive relationship with distance response and no_pasg has a negative one. Height and pitch have a slight positive relationship with distance. In all 3 tables, speed_ground and speed_air are the 2 most important predictors.

Part V: Checking for multicollinearity 16. Compare the regression coefficients of the three models below: Model 1: distance ~ speed_ground Model 2: distance ~ speed_air Model 3: distance ~ speed_ground + speed_air. Do you observe any potential problems in the sign/significance of each term? Check the correlation between speed_ground and speed_air. If you had to choose between one of these two variables to include in the 2 model, which would it be and why? What other methods could you use to deal with potential multicollinearity? Are there any drawbacks to these other methods?

model1 <- lm(distance ~ speed_ground, data = FAA_clean)
coef1 <- summary(model1)$coefficients[2, 1]

model2 <- lm(distance ~ speed_air, data = FAA_clean)
coef2 <- summary(model2)$coefficients[2, 1]

model3 <- lm(distance ~ speed_ground + speed_air, data = FAA_clean)
coef3 <- summary(model3)$coefficients[2:3, 1]

coef_table <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3 (speed_ground)", "Model 3 (speed_air)"),
  Coefficient = c(coef1, coef2, coef3[1], coef3[2])
)

coef_table
##                    Model Coefficient
## 1                Model 1    41.56248
## 2                Model 2    79.57216
## 3 Model 3 (speed_ground)   -15.66459
## 4    Model 3 (speed_air)    95.30084
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(FAA_clean[, c("speed_ground", "speed_air")])
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 630 rows containing missing values
## Warning: Removed 630 rows containing missing values (`geom_point()`).
## Warning: Removed 630 rows containing non-finite values (`stat_density()`).

The results suggest there’s multicollinearity between speed_ground and speed_air. In model 1 and model 2, both the coefficients for speed_ground and speed_air are positive. However, in model 3 when both predictors are included in the same regression, the coefficient for speed_ground turn negative.

A pairwise plot also shows that the correlation between the 2 predictors are 0.988 indicating multicollinearity issue.

If we have to choose one of the two variables to include in the model, we would pick speed_ground since there are 642 missing values in speed_air out of 836 total observations.

Methods to deal with multicollinearity are considering dropping one or some of the correlated predictors, using Principal Component Analysis (PCA) and penalized regression (ridge regression and LASSO).

There are drawbacks to these methods: - Dropping predictors: may result in loss of unique information. - PCA: creates less interpretable components. - Penalized regression: choosing the right penalty parameter could be difficult. Ridge doesn’t perform feature selection, while LASSO might ignore some correlated predictors.

Part VI: Variable selection based on Table 0 17. Suppose in Table 0, the variable ranking is as follows: X1, X2, X3, X4, X5, and X6, where X1 and X6 are the most important and least important predictors, respectively. Please fit the following six models: • Model 1: Y ~ X1 (i.e., regress distance on the most important predictor from Table 0) • Model 2: Y ~ X1 + X2 (i.e., regress distance on the top two most important predictors from Table 0) • .And so forth. . . Calculate R2 (i.e., ) R-squared for each model. Construct a plot of R2 (y-axis) vs. the number of variables p (x-axis). What pattern(s) do you observe?

predictors <- c("speed_air", "speed_ground", "aircraft", "height", "pitch", "no_pasg")

r_squared <- numeric()

for (i in 1:length(predictors)) {
  formula <- as.formula(paste("distance ~", paste(predictors[1:i], collapse = " + ")))
  
  model <- lm(formula, data = FAA_clean)
  
  r_squared[i] <- summary(model)$r.squared
}

plot(1:length(predictors), r_squared, type = "b", xlab = "Number of Variables", ylab = "R-squared")

R-squared value increases as we go from 1 to 4 variables suggesting adding more variables would lead to improvement in the model’s predictive power in explaining the variability of landing distance. However, the R-squared value starts to plateau when we have more than 4 variables indicating that adding more variables after 4 predictors would not help improve the model’s predictive capability. This pattern suggests that simpler model with fewer predictors is more favorable.

  1. Repeat 17), but use adjusted R2 instead.
predictors <- c("speed_air", "speed_ground", "aircraft", "height", "pitch", "no_pasg")

adjusted_r_squared <- numeric()

for (i in 1:length(predictors)) {
  formula <- as.formula(paste("distance ~", paste(predictors[1:i], collapse = " + ")))
  
  model <- lm(formula, data = FAA_clean)
  
  adjusted_r_squared[i] <- summary(model)$adj.r.squared
}

plot(1:length(predictors), adjusted_r_squared, type = "b", xlab = "Number of Variables", ylab = "Adjusted R-squared")

We see the similar pattern when using adjusted R-squared instead. A simpler model with fewer variables is more desirable.

  1. Repeat 17. but use AIC instead.
predictors <- c("speed_air", "speed_ground", "aircraft", "height", "pitch", "no_pasg")

aic_values <- numeric()

for (i in 1:length(predictors)) {
  formula <- as.formula(paste("distance ~", paste(predictors[1:i], collapse = " + ")))
  
  model <- lm(formula, data = FAA_clean)
  
  aic_values[i] <- AIC(model)
}

plot(1:length(predictors), aic_values, type = "b", xlab = "Number of Variables", ylab = "AIC")

The plot shows that adding more predictors to the model improves the model’s fit as evidenced by the decreasing AIC values. However, the decrease in AIC starts to plateau after 4 variables. This approach also highlights the same conclusion that a simpler model with fewer predictors is preferable.

  1. Compare the results 17–19. What variables would you select to build a predictive model for predicting distance? Any problems with this approach?

Based on the results of using R-squared, adjusted R-squared and AIC plots, it is recommended to use the variables speed_ground, speed_air, height, pitch to build the predictive model for predicting distance.

However, there are some problems with this approach. Speed_ground and speed_air are highly correlated which could lead to multicollinearity issue if including in the same model and result in unstable estimates of regression coefficients and high standard errors. Therefore, speed_air should not be used due to multicollinearity issue and having high number of missing values (642 missing values).

Part VII: Variable selection using automatic search procedures 21. Use the R function StepAIC() from package MASS to perform forward variable selection; see ?MASS:stepAIC for details and usage examples (or ask in the class Teams chat). Compare the result with those from 19).

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
initial_model <- lm(distance ~ 1, data = FAA_clean)

final_model <- stepAIC(initial_model, direction = "forward", scope = list(lower = ~1, upper = ~speed_ground + height + pitch + no_pasg + aircraft), trace = TRUE)
## Start:  AIC=11371.66
## distance ~ 1
## 
##                Df Sum of Sq       RSS   AIC
## + speed_ground  1 506163426 167807054 10211
## + aircraft      1  38892529 635077951 11324
## + height        1   7813761 666156719 11364
## + pitch         1   5839140 668131339 11366
## <none>                      673970479 11372
## + no_pasg       1    301477 673669002 11373
## 
## Step:  AIC=10211.31
## distance ~ speed_ground
## 
##            Df Sum of Sq       RSS     AIC
## + aircraft  1  50608159 117198895  9913.2
## + height    1  15498180 152308874 10132.3
## + pitch     1  10245405 157561649 10160.6
## <none>                  167807054 10211.3
## + no_pasg   1    231162 167575892 10212.2
## 
## Step:  AIC=9913.23
## distance ~ speed_ground + aircraft
## 
##           Df Sum of Sq       RSS    AIC
## + height   1  16200656 100998240 9790.9
## + pitch    1    522429 116676466 9911.5
## <none>                 117198895 9913.2
## + no_pasg  1     99811 117099084 9914.5
## 
## Step:  AIC=9790.86
## distance ~ speed_ground + aircraft + height
## 
##           Df Sum of Sq       RSS    AIC
## + pitch    1    355456 100642783 9789.9
## <none>                 100998240 9790.9
## + no_pasg  1    235620 100762620 9790.9
## 
## Step:  AIC=9789.91
## distance ~ speed_ground + aircraft + height + pitch
## 
##           Df Sum of Sq       RSS    AIC
## <none>                 100642783 9789.9
## + no_pasg  1    228062 100414722 9790.0
summary(final_model)
## 
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height + pitch, 
##     data = FAA_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -718.27 -221.00  -90.67  128.19 1501.18 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2197.7391   122.6031 -17.926   <2e-16 ***
## speed_ground    42.4750     0.6444  65.918   <2e-16 ***
## aircraft      -481.2831    25.8330 -18.631   <2e-16 ***
## height          14.1590     1.2306  11.506   <2e-16 ***
## pitch           41.8922    24.4529   1.713   0.0871 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 348 on 831 degrees of freedom
## Multiple R-squared:  0.8507, Adjusted R-squared:   0.85 
## F-statistic:  1183 on 4 and 831 DF,  p-value: < 2.2e-16

The AIC value of stepAIC approach is 9789.91 which is a lot higher then the AIC values (2850 - 2900) of question 19 (basically a manual process of forward selection). This could be due to question 19 include speed_air in the process and does not penalize model complexity.

Part VIII: The mirage of variable selection Watch the following talk by Frank Harrell (aside from the insightful practical bits offered in this awesome video, future homework assignments may reference Frank’s talk as well, so don’t cheat yourself and please watch): https://www.youtube.com/watch?v=DF1WsYZ94Es Pay particular attention to his comments on variable selection starting around the 16:45 mark. In 2–3 paragraphs, discuss some of the issues regarding variable selection in statistical modeling.

Variable selection in statistical modeling is a complex process with several challenges. One of the big issues is the trade-off between parsimony and predictive discrimination. Parsimony encourages usage of simplest model possible with ease of interpretability but this could limit the model’s predictive power. The more complex the more is, the better it is at making predictions but it comes with costs of overfitting and difficulty in interpreting the results.

Risk of information loss is another issue in variable selection. Deciding which features to keep uses up some of the data’s information, potentially leading to a loss of valuable insights. Also, Frank emphasizes the fact that the probability of selecting the right variables is nearly zero.

Even some of the good variable selection methods such as the Lasso are not always able to exclude the irrelevant unimportant predictors or select the important variables.