Questions Part I: Initial exploration 1. Read the two files FAA1.xls (800 flights) and FAA2.xls into R. 2. Check the structure of each data set using the str() function. For each data set, what is the sample size and number of variables? Are there any differences between the two data sets? 3. Merge the two data sets. Are there any duplicate observations? If there are duplicate observations, what actions would you take before doing any further analysis? 4. Check the structure of the combined data set. Provide an appropriate (and concise) summary of each variable. (Graphics would be great, too!) 5. At this point, you are asked by FAA agents to prepare ONE presentation slide to summarize your findings (3–7 bullet points will suffice). What observations will you bring to the attention of the agents?
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.2
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
FAA1 <- read_excel("FAA1.xls")
FAA2 <- read_excel("FAA2.xls")
str(FAA1)
## tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:800] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:800] 109 103 NA NA NA ...
## $ height : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:800] 3370 2988 1145 1664 1050 ...
str(FAA2)
## tibble [150 × 7] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:150] 109 103 NA NA NA ...
## $ height : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:150] 3370 2988 1145 1664 1050 ...
FAA1 dataset has duration variable while FAA2 dataset does not.
FAA1$duration <- NULL
FAA <- bind_rows(FAA1, FAA2)
FAA$dup_flag <- as.numeric(duplicated(FAA))
FAA <- FAA[FAA$dup_flag == 0, ]
FAA <- FAA[, 1:7]
There are 100 duplicated values after merging the 2 FAA files. We have removed the duplicated values prior to doing any further analysis.
str(FAA)
## tibble [850 × 7] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:850] 109 103 NA NA NA ...
## $ height : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:850] 3370 2988 1145 1664 1050 ...
summary(FAA)
## aircraft no_pasg speed_ground speed_air
## Length:850 Min. :29.0 Min. : 27.74 Min. : 90.00
## Class :character 1st Qu.:55.0 1st Qu.: 65.90 1st Qu.: 96.25
## Mode :character Median :60.0 Median : 79.64 Median :101.15
## Mean :60.1 Mean : 79.45 Mean :103.80
## 3rd Qu.:65.0 3rd Qu.: 92.06 3rd Qu.:109.40
## Max. :87.0 Max. :141.22 Max. :141.72
## NA's :642
## height pitch distance
## Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :30.093 Median :4.008 Median :1258.09
## Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :59.946 Max. :5.927 Max. :6533.05
##
par(mfrow=c(2,4))
for (i in which(sapply(FAA, is.numeric))){
boxplot(FAA[,i], main=colnames(FAA[i]), ylab="Value")
}
Part II: Data cleaning and further exploration 6. Are there any “abnormal” values in the data set? (You can refer to the data dictionary for criteria defining “normal/abnormal” values for each column.) Remove any rows that contain any “abnormal” values and report how many rows you have removed.
FAA_clean <- subset(FAA,
(is.na(speed_ground) | (speed_ground >= 30 & speed_ground <= 140)) &
(is.na(speed_air) | (speed_air >= 30 & speed_air <= 140)) &
(is.na(height) | height >= 6) &
(is.na(distance) | distance < 6000))
rows_removed <- nrow(FAA) - nrow(FAA_clean)
We have removed 14 rows that contain the abnormal values.
str(FAA_clean)
## tibble [836 × 7] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:836] "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num [1:836] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:836] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:836] 109 103 NA NA NA ...
## $ height : num [1:836] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:836] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:836] 3370 2988 1145 1664 1050 ...
summary(FAA_clean)
## aircraft no_pasg speed_ground speed_air
## Length:836 Min. :29.00 Min. : 33.57 Min. : 90.00
## Class :character 1st Qu.:55.00 1st Qu.: 66.20 1st Qu.: 96.21
## Mode :character Median :60.00 Median : 79.83 Median :101.11
## Mean :60.04 Mean : 79.59 Mean :103.46
## 3rd Qu.:65.00 3rd Qu.: 92.11 3rd Qu.:109.33
## Max. :87.00 Max. :132.78 Max. :132.91
## NA's :630
## height pitch distance
## Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.:23.578 1st Qu.:3.640 1st Qu.: 895.83
## Median :30.210 Median :4.002 Median :1263.54
## Mean :30.510 Mean :4.005 Mean :1526.05
## 3rd Qu.:37.038 3rd Qu.:4.370 3rd Qu.:1939.68
## Max. :59.946 Max. :5.927 Max. :5381.96
##
par(mfrow=c(2,4))
for(i in which(sapply(FAA_clean, is.numeric))){
hist(FAA_clean[,i][!is.na(FAA_clean[,i])], main=colnames(FAA_clean[i]), xlab="Value", col="blue")
}
Part III: Initial analysis for identifying important factors that impact the re- sponse variable distance 10. Compute the pairwise correlation between the distance and each predictor (i.e., every other column; consider re-encoding aircraft as 0/1 for boeing/airbus). Provide a table that ranks the predictors in descending order based on the strength (i.e., absolute value) of the correlations. This table should contain three columns: • the names of variables • the size of the correlation • the direction of the correlation (i.e., positive or negative). Refer to this table as Table 1, which will be used for comparison with our analysis later.
attach(FAA_clean)
FAA_clean$aircraft <- as.numeric(ifelse(FAA_clean$aircraft == "boeing", 0, 1))
correlations <- cor(FAA_clean[,sapply(FAA_clean, is.numeric)], use = "complete.obs")
distance_correlations <- correlations[,"distance"]
distance_correlations <- distance_correlations[!names(distance_correlations) %in% "distance"]
correlation_df <- data.frame(
Variable = names(distance_correlations),
Correlation = distance_correlations,
Direction = ifelse(distance_correlations > 0, "Positive", "Negative")
)
table1 <- correlation_df[order(abs(correlation_df$Correlation), decreasing = TRUE),]
table1
## Variable Correlation Direction
## speed_air speed_air 0.94138896 Positive
## speed_ground speed_ground 0.92551427 Positive
## aircraft aircraft -0.18678435 Negative
## height height 0.07308959 Positive
## pitch pitch 0.04823992 Positive
## no_pasg no_pasg -0.03737893 Negative
# Aircraft
plot_1 <- ggplot(FAA_clean, aes(x = distance, y = aircraft)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "aircraft", title = "")
# Air Speed
plot_2 <- ggplot(FAA_clean, aes(x = distance, y = speed_air)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "speed_air", title = "")
# Ground Speed
plot_3 <- ggplot(FAA_clean, aes(x = distance, y = speed_ground)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "speed_ground", title = "")
# Height
plot_4 <- ggplot(FAA_clean, aes(x = distance, y = height)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "height", title = "")
# Pitch
plot_5 <- ggplot(FAA_clean, aes(x = distance, y = pitch)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "pitch", title = "")
# Number of Passengers
plot_6 <- ggplot(FAA_clean, aes(x = distance, y = no_pasg)) +
geom_point(alpha = 0.9,color='blue') +
labs(y = "no_pasg", title = "")
gridExtra::grid.arrange(plot_1, plot_2, plot_3, plot_4, plot_5, plot_6, ncol = 3)
## Warning: Removed 630 rows containing missing values (`geom_point()`).
We think that the correlation strength observed in these plots is consistent with the results displayed in Table 1.
We think including aircraft in this analysis could lead to a few issues: - Binary encoding of aircraft would assume both Boeing and Airbus are the same which might not be true. Boeing aircrafts could be bigger or smaller and vice versa. - Correlation might not work here with a binary variable.
We think we could use interaction effects (i.e. if the effect of other variables on the distance response is different for Boeing and Airbus) and linear regression.
Part IV: Regression using a single factor each time 13. Regress distance on each of the predictors. Provide a table that ranks the factors based on its significance. The smaller the p-value, the more “important” the predictor. Organize the results in a new table, called Table 2, that’s similar to Table 1.
results <- data.frame(Variable = character(), P_Value = numeric())
predictors <- c("aircraft", "speed_air", "speed_ground", "height", "pitch", "no_pasg")
for (var in predictors) {
model <- lm(paste("distance ~", var), data = FAA_clean)
p_value <- summary(model)$coefficients[2, 4]
results <- rbind(results, data.frame(Variable = var, P_Value = p_value))
}
table2 <- results[order(results$P_Value), ]
table2
## Variable P_Value
## 3 speed_ground 5.079768e-254
## 2 speed_air 3.118417e-98
## 1 aircraft 1.941505e-12
## 4 height 1.822986e-03
## 5 pitch 7.079296e-03
## 6 no_pasg 5.414166e-01
results2 <- data.frame(Variable = character(), Coefficient = numeric(), Sign = character())
FAA_clean_standardized <- data.frame(scale(FAA_clean[predictors]))
for (var in predictors) {
model <- lm(paste("distance ~", var), data = FAA_clean_standardized)
coefficient <- summary(model)$coefficients[2, 1]
sign <- ifelse(coefficient > 0, "Positive", "Negative")
results2 <- rbind(results2, data.frame(Variable = var, Coefficient = coefficient, Sign = sign))
}
table3 <- results2[order(-abs(results2$Coefficient)), ]
table3
## Variable Coefficient Sign
## 3 speed_ground 778.57803 Positive
## 2 speed_air 771.26509 Positive
## 1 aircraft -215.81909 Negative
## 4 height 96.73571 Positive
## 5 pitch 83.62405 Positive
## 6 no_pasg -19.00133 Negative
table1$Rank <- rank(-abs(table1$Correlation))
table2$Rank <- rank(-log10(table2$P_Value))
table3$Rank <- rank(-abs(table3$Coefficient))
combined <- rbind(table1[, c("Variable", "Rank")],
table2[, c("Variable", "Rank")],
table3[, c("Variable", "Rank")])
average_rank <- aggregate(Rank ~ Variable, data = combined, FUN = mean)
table0 <- average_rank[order(average_rank$Rank), ]
table0
## Variable Rank
## 5 speed_air 2.666667
## 6 speed_ground 3.000000
## 1 aircraft 3.333333
## 2 height 3.666667
## 4 pitch 4.000000
## 3 no_pasg 4.333333
Table 1, table 2 and table all agree that speed_air and speed_ground have positive relationship with distance response and no_pasg has a negative one. Height and pitch have a slight positive relationship with distance. In all 3 tables, speed_ground and speed_air are the 2 most important predictors.
Part V: Checking for multicollinearity 16. Compare the regression coefficients of the three models below: Model 1: distance ~ speed_ground Model 2: distance ~ speed_air Model 3: distance ~ speed_ground + speed_air. Do you observe any potential problems in the sign/significance of each term? Check the correlation between speed_ground and speed_air. If you had to choose between one of these two variables to include in the 2 model, which would it be and why? What other methods could you use to deal with potential multicollinearity? Are there any drawbacks to these other methods?
model1 <- lm(distance ~ speed_ground, data = FAA_clean)
coef1 <- summary(model1)$coefficients[2, 1]
model2 <- lm(distance ~ speed_air, data = FAA_clean)
coef2 <- summary(model2)$coefficients[2, 1]
model3 <- lm(distance ~ speed_ground + speed_air, data = FAA_clean)
coef3 <- summary(model3)$coefficients[2:3, 1]
coef_table <- data.frame(
Model = c("Model 1", "Model 2", "Model 3 (speed_ground)", "Model 3 (speed_air)"),
Coefficient = c(coef1, coef2, coef3[1], coef3[2])
)
coef_table
## Model Coefficient
## 1 Model 1 41.56248
## 2 Model 2 79.57216
## 3 Model 3 (speed_ground) -15.66459
## 4 Model 3 (speed_air) 95.30084
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(FAA_clean[, c("speed_ground", "speed_air")])
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 630 rows containing missing values
## Warning: Removed 630 rows containing missing values (`geom_point()`).
## Warning: Removed 630 rows containing non-finite values (`stat_density()`).
The results suggest there’s multicollinearity between speed_ground and speed_air. In model 1 and model 2, both the coefficients for speed_ground and speed_air are positive. However, in model 3 when both predictors are included in the same regression, the coefficient for speed_ground turn negative.
A pairwise plot also shows that the correlation between the 2 predictors are 0.988 indicating multicollinearity issue.
If we have to choose one of the two variables to include in the model, we would pick speed_ground since there are 642 missing values in speed_air out of 836 total observations.
Methods to deal with multicollinearity are considering dropping one or some of the correlated predictors, using Principal Component Analysis (PCA) and penalized regression (ridge regression and LASSO).
There are drawbacks to these methods: - Dropping predictors: may result in loss of unique information. - PCA: creates less interpretable components. - Penalized regression: choosing the right penalty parameter could be difficult. Ridge doesn’t perform feature selection, while LASSO might ignore some correlated predictors.
Part VI: Variable selection based on Table 0 17. Suppose in Table 0, the variable ranking is as follows: X1, X2, X3, X4, X5, and X6, where X1 and X6 are the most important and least important predictors, respectively. Please fit the following six models: • Model 1: Y ~ X1 (i.e., regress distance on the most important predictor from Table 0) • Model 2: Y ~ X1 + X2 (i.e., regress distance on the top two most important predictors from Table 0) • .And so forth. . . Calculate R2 (i.e., ) R-squared for each model. Construct a plot of R2 (y-axis) vs. the number of variables p (x-axis). What pattern(s) do you observe?
predictors <- c("speed_air", "speed_ground", "aircraft", "height", "pitch", "no_pasg")
r_squared <- numeric()
for (i in 1:length(predictors)) {
formula <- as.formula(paste("distance ~", paste(predictors[1:i], collapse = " + ")))
model <- lm(formula, data = FAA_clean)
r_squared[i] <- summary(model)$r.squared
}
plot(1:length(predictors), r_squared, type = "b", xlab = "Number of Variables", ylab = "R-squared")
R-squared value increases as we go from 1 to 4 variables suggesting
adding more variables would lead to improvement in the model’s
predictive power in explaining the variability of landing distance.
However, the R-squared value starts to plateau when we have more than 4
variables indicating that adding more variables after 4 predictors would
not help improve the model’s predictive capability. This pattern
suggests that simpler model with fewer predictors is more favorable.
predictors <- c("speed_air", "speed_ground", "aircraft", "height", "pitch", "no_pasg")
adjusted_r_squared <- numeric()
for (i in 1:length(predictors)) {
formula <- as.formula(paste("distance ~", paste(predictors[1:i], collapse = " + ")))
model <- lm(formula, data = FAA_clean)
adjusted_r_squared[i] <- summary(model)$adj.r.squared
}
plot(1:length(predictors), adjusted_r_squared, type = "b", xlab = "Number of Variables", ylab = "Adjusted R-squared")
We see the similar pattern when using adjusted R-squared instead. A
simpler model with fewer variables is more desirable.
predictors <- c("speed_air", "speed_ground", "aircraft", "height", "pitch", "no_pasg")
aic_values <- numeric()
for (i in 1:length(predictors)) {
formula <- as.formula(paste("distance ~", paste(predictors[1:i], collapse = " + ")))
model <- lm(formula, data = FAA_clean)
aic_values[i] <- AIC(model)
}
plot(1:length(predictors), aic_values, type = "b", xlab = "Number of Variables", ylab = "AIC")
The plot shows that adding more predictors to the model improves the model’s fit as evidenced by the decreasing AIC values. However, the decrease in AIC starts to plateau after 4 variables. This approach also highlights the same conclusion that a simpler model with fewer predictors is preferable.
Based on the results of using R-squared, adjusted R-squared and AIC plots, it is recommended to use the variables speed_ground, speed_air, height, pitch to build the predictive model for predicting distance.
However, there are some problems with this approach. Speed_ground and speed_air are highly correlated which could lead to multicollinearity issue if including in the same model and result in unstable estimates of regression coefficients and high standard errors. Therefore, speed_air should not be used due to multicollinearity issue and having high number of missing values (642 missing values).
Part VII: Variable selection using automatic search procedures 21. Use the R function StepAIC() from package MASS to perform forward variable selection; see ?MASS:stepAIC for details and usage examples (or ask in the class Teams chat). Compare the result with those from 19).
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
initial_model <- lm(distance ~ 1, data = FAA_clean)
final_model <- stepAIC(initial_model, direction = "forward", scope = list(lower = ~1, upper = ~speed_ground + height + pitch + no_pasg + aircraft), trace = TRUE)
## Start: AIC=11371.66
## distance ~ 1
##
## Df Sum of Sq RSS AIC
## + speed_ground 1 506163426 167807054 10211
## + aircraft 1 38892529 635077951 11324
## + height 1 7813761 666156719 11364
## + pitch 1 5839140 668131339 11366
## <none> 673970479 11372
## + no_pasg 1 301477 673669002 11373
##
## Step: AIC=10211.31
## distance ~ speed_ground
##
## Df Sum of Sq RSS AIC
## + aircraft 1 50608159 117198895 9913.2
## + height 1 15498180 152308874 10132.3
## + pitch 1 10245405 157561649 10160.6
## <none> 167807054 10211.3
## + no_pasg 1 231162 167575892 10212.2
##
## Step: AIC=9913.23
## distance ~ speed_ground + aircraft
##
## Df Sum of Sq RSS AIC
## + height 1 16200656 100998240 9790.9
## + pitch 1 522429 116676466 9911.5
## <none> 117198895 9913.2
## + no_pasg 1 99811 117099084 9914.5
##
## Step: AIC=9790.86
## distance ~ speed_ground + aircraft + height
##
## Df Sum of Sq RSS AIC
## + pitch 1 355456 100642783 9789.9
## <none> 100998240 9790.9
## + no_pasg 1 235620 100762620 9790.9
##
## Step: AIC=9789.91
## distance ~ speed_ground + aircraft + height + pitch
##
## Df Sum of Sq RSS AIC
## <none> 100642783 9789.9
## + no_pasg 1 228062 100414722 9790.0
summary(final_model)
##
## Call:
## lm(formula = distance ~ speed_ground + aircraft + height + pitch,
## data = FAA_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -718.27 -221.00 -90.67 128.19 1501.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2197.7391 122.6031 -17.926 <2e-16 ***
## speed_ground 42.4750 0.6444 65.918 <2e-16 ***
## aircraft -481.2831 25.8330 -18.631 <2e-16 ***
## height 14.1590 1.2306 11.506 <2e-16 ***
## pitch 41.8922 24.4529 1.713 0.0871 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 348 on 831 degrees of freedom
## Multiple R-squared: 0.8507, Adjusted R-squared: 0.85
## F-statistic: 1183 on 4 and 831 DF, p-value: < 2.2e-16
The AIC value of stepAIC approach is 9789.91 which is a lot higher then the AIC values (2850 - 2900) of question 19 (basically a manual process of forward selection). This could be due to question 19 include speed_air in the process and does not penalize model complexity.
Part VIII: The mirage of variable selection Watch the following talk by Frank Harrell (aside from the insightful practical bits offered in this awesome video, future homework assignments may reference Frank’s talk as well, so don’t cheat yourself and please watch): https://www.youtube.com/watch?v=DF1WsYZ94Es Pay particular attention to his comments on variable selection starting around the 16:45 mark. In 2–3 paragraphs, discuss some of the issues regarding variable selection in statistical modeling.
Variable selection in statistical modeling is a complex process with several challenges. One of the big issues is the trade-off between parsimony and predictive discrimination. Parsimony encourages usage of simplest model possible with ease of interpretability but this could limit the model’s predictive power. The more complex the more is, the better it is at making predictions but it comes with costs of overfitting and difficulty in interpreting the results.
Risk of information loss is another issue in variable selection. Deciding which features to keep uses up some of the data’s information, potentially leading to a loss of valuable insights. Also, Frank emphasizes the fact that the probability of selecting the right variables is nearly zero.
Even some of the good variable selection methods such as the Lasso are not always able to exclude the irrelevant unimportant predictors or select the important variables.