FAA1 <- read.csv("FAA1.csv", header = TRUE)
FAA2 <- read.csv("FAA2.csv", header = TRUE)
FAA2 <- FAA2[-c(151:200), ]
I used the read.csv function to load both csv files. I have previously saved the xls file downloaded from Canvas into csv files so it’s easier to load. Also, there were 50 rows of data with NA values, I have removed them.
str(FAA1)
'data.frame': 800 obs. of 8 variables:
$ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
$ duration : num 98.5 125.7 112 196.8 90.1 ...
$ no_pasg : int 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num 109 103 NA NA NA ...
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
str(FAA2)
'data.frame': 150 obs. of 7 variables:
$ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
$ no_pasg : int 53 69 61 56 70 55 54 57 61 56 ...
$ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
$ speed_air : num 109 103 NA NA NA ...
$ height : num 27.4 27.8 18.6 30.7 32.4 ...
$ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
$ distance : num 3370 2988 1145 1664 1050 ...
Sample size of FAA1 is 800, FAA2 is 150. FAA1 has 8 different variables and FAA2 only has 7, FAA2 does not have the duration variable.
#outerjoin both FAA1 and FAA2
merged_data_outer <- merge(FAA1, FAA2, by = c("aircraft","no_pasg", "speed_ground","speed_air","height", "pitch", "distance"), all = TRUE)
#count the number of duplicates
num_duplicates <- sum(duplicated(merged_data_outer))
print(num_duplicates)
[1] 0
FAA <- merged_data_outer
attach(FAA)
I chose to use outer join so I can keep all unique data from both data sets. The outer join got rid of 100 rows of duplicate data, with duration values being NA. I used “duplicated” function to get a count of the duplicated data to verify there is no longer duplicate values, and it was zero.
str(FAA)
'data.frame': 850 obs. of 8 variables:
$ aircraft : chr "airbus" "airbus" "airbus" "airbus" ...
$ no_pasg : int 36 38 40 41 43 44 45 45 45 45 ...
$ speed_ground: num 47.5 85.2 80.6 97.6 82.5 ...
$ speed_air : num NA NA NA 97 NA ...
$ height : num 14 37 28.6 38.4 30.1 ...
$ pitch : num 4.3 4.12 3.62 3.53 4.09 ...
$ distance : num 251 1257 1021 2168 1321 ...
$ duration : num 172 188 93.5 123.3 109.2 ...
summary(FAA)
aircraft no_pasg speed_ground speed_air height pitch
Length:850 Min. :29.0 Min. : 27.74 Min. : 90.00 Min. :-3.546 Min. :2.284
Class :character 1st Qu.:55.0 1st Qu.: 65.90 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642
Mode :character Median :60.0 Median : 79.64 Median :101.15 Median :30.093 Median :4.008
Mean :60.1 Mean : 79.45 Mean :103.80 Mean :30.144 Mean :4.009
3rd Qu.:65.0 3rd Qu.: 92.06 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377
Max. :87.0 Max. :141.22 Max. :141.72 Max. :59.946 Max. :5.927
NA's :642
distance duration
Min. : 34.08 Min. : 14.76
1st Qu.: 883.79 1st Qu.:119.49
Median :1258.09 Median :153.95
Mean :1526.02 Mean :154.01
3rd Qu.:1936.95 3rd Qu.:188.91
Max. :6533.05 Max. :305.62
NA's :50
The new data set FAA has a sample size of 850 and 8 variables.
The total observation of the data has a sample size of 850. There are 8 variables in the dataset, and 7 are predictor variables, and one is the response variable.
There is a binary variable - Aircraft, has two values, Boeing and Airbus.
There are 50 missing values in the variable “duration”.
The average landing distance is 1526, however, there are extreme values such as a minimum of 34 and a maximum of 6533. These are not normal which may need further analysis.
There is a negative amount in the height variable, which is not normal.
#remove duration not greater than 40
FAA_new <- subset(FAA, duration > 40)
#remove speed_ground that's less than 30 or greater than 140
FAA_new <- subset(FAA_new, speed_ground >=30 & speed_ground <=140)
#remove speed_air that's less than 30 or greater than 140
FAA_new <- subset(FAA_new, speed_air >=30 & speed_ground <=140)
#remove height that's less than 6 meters
FAA_new <- subset(FAA_new, height >=6)
#remove distance that's over 6000 feet
FAA_new <- subset(FAA_new, distance < 6000)
Yes, there were abnormal values in the data set. Based on the variable dictionary, I have removed abnormal values, and now I have the sample size of 195.
str(FAA_new)
'data.frame': 195 obs. of 8 variables:
$ aircraft : chr "airbus" "airbus" "airbus" "airbus" ...
$ no_pasg : int 41 44 47 48 48 48 49 49 49 51 ...
$ speed_ground: num 97.6 99.6 92.9 101.8 109.3 ...
$ speed_air : num 97 99.2 95.8 103.6 109.6 ...
$ height : num 38.4 35.2 23.8 23 33.1 ...
$ pitch : num 3.53 3.84 3.91 4.94 4.04 ...
$ distance : num 2168 2116 1955 2525 3177 ...
$ duration : num 123 139 169 157 200 ...
summary(FAA_new)
aircraft no_pasg speed_ground speed_air height pitch
Length:195 Min. :41.00 Min. : 88.69 Min. : 90.00 Min. : 9.697 Min. :2.702
Class :character 1st Qu.:56.00 1st Qu.: 95.28 1st Qu.: 96.15 1st Qu.:23.365 1st Qu.:3.636
Mode :character Median :60.00 Median :100.75 Median :100.89 Median :29.837 Median :4.070
Mean :59.83 Mean :103.43 Mean :103.50 Mean :30.359 Mean :4.043
3rd Qu.:65.00 3rd Qu.:109.57 3rd Qu.:109.42 3rd Qu.:36.590 3rd Qu.:4.442
Max. :80.00 Max. :132.78 Max. :132.91 Max. :58.228 Max. :5.311
distance duration
Min. :1741 Min. : 45.5
1st Qu.:2161 1st Qu.:115.9
Median :2526 Median :149.3
Mean :2784 Mean :150.9
3rd Qu.:3186 3rd Qu.:185.4
Max. :5382 Max. :287.0
# Load the tidyverse package
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2 ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
selected_vars <- c("speed_ground", "speed_air", "height","pitch","duration","distance","no_pasg")
#aircraft was not selected as it was a categorical variable.
# Use walk to plot histograms for the selected variables
walk(selected_vars, ~ {
hist(FAA_new[[.x]], main = paste("Histogram of", .x),
xlab = .x, col = "blue", border = "white", prob = TRUE)
lines(density(FAA_new[[.x]], na.rm = TRUE), col = "red", lwd = 2)
})
We can see that speed_ground, speed_air, and distance are right skewed, and the rest of the variables shows normal distribution.
The new set of data has a sample size of 195, there are 7 numeric variables and one binary variable.
The speed_ground and speed_air value is very similar and both histogram shows skewed to the right. The mean speed is around 100 MPH.
Number of Passengers shows the average number is around 60 passengers per fight.
There are outliers in height, where the average is about 30, we observed that there are height as small as 9.6 and as large as 58.
The average of a flight being in the air is about 150.9.
# Load the knitr package for creating a formatted table.
library(knitr)
# Convert 'aircraft' to numeric: 1 for 'airbus', 0 for 'boeing'
FAA_new$aircraft_numeric <- ifelse(FAA_new$aircraft == "airbus", 1, 0)
# View the updated dataframe to confirm the changes
head(FAA_new)
factors <- c("speed_ground", "speed_air", "height", "pitch", "duration", "no_pasg", "aircraft_numeric")
# Initialize an empty data frame to store the results
correlations <- data.frame(
Variable = character(),
SizeOfCorrelation = numeric(),
DirectionOfCorrelation = character(),
stringsAsFactors = FALSE
)
# Calculate the pairwise correlations
for (var in factors) {
corr <- cor(FAA_new$distance, FAA_new[[var]], use = "complete.obs")
correlations <- rbind(correlations, data.frame(
Variable = var,
SizeOfCorrelation = abs(corr),
DirectionOfCorrelation = ifelse(corr > 0, "Positive", "Negative")
))
}
# Sort the data frame by the size of the correlation
correlations <- correlations[order(-correlations$SizeOfCorrelation), ]
# Print the table
cat("Table 1")
Table 1
print(correlations)
library(ggplot2)
selected_vars <- c("speed_ground", "speed_air", "height", "pitch", "duration", "no_pasg")
#Loop through each variable and create scatter plots
for (var in selected_vars) {
p <- ggplot(FAA_new, aes_string(x = var, y = "distance", color = "aircraft")) +
geom_point() +
ggtitle(paste("Scatter Plot of", var, "vs Landing Distance by Aircraft Type")) +
xlab(var) +
ylab("Distance") +
theme_minimal() +
scale_color_manual(values = c("blue", "red"))
print(p)
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
Please use tidy evaluation idioms with `aes()`.
See also `vignette("ggplot2-in-packages")` for more information.
Yes, the correlation strength observed in these plots is consistent with the values computed. Based on the plots, we can clearly see that the speed_air and speed_ground has strong linear relationship with the distance, when speed_air and speed_ground value increases, distance also increases. The rest of the variables are scattered in the plots.
I have included aircraft make in step 10. I wanted to include this variable to see if it has any correlation with the distance variable. Since it’s a categorical variable, I needed to transform the variable from a character variable to a numeric variable. In step 11, I decided to show the scatter plot grouped by the aircraft variable to provide better visualization.
# Initialize an empty data frame to store the results
regression_results <- data.frame(
Variable = character(),
P_Value = numeric(),
DirectionOfCoefficient = character(),
stringsAsFactors = FALSE
)
# Loop through each factor to perform regression, factors were created in previous steps
for (var in factors) {
# Define the regression formula
formula <- as.formula(paste("distance ~", var))
# Fit the linear model
model <- lm(formula, data = FAA_new)
# Get the p-value and coefficient for the factor
p_value <- summary(model)$coefficients[2, 4]
coefficient <- summary(model)$coefficients[2, 1]
# Determine the direction of the coefficient
direction <- ifelse(coefficient > 0, "Positive", "Negative")
# Append the results to the data frame
regression_results <- rbind(regression_results, data.frame(
Variable = var,
P_Value = p_value,
DirectionOfCoefficient = direction
))
}
# Sort the data frame by P-Value
regression_results <- regression_results[order(regression_results$P_Value), ]
# Print the final table
cat("Table 2")
Table 2
print(regression_results)
# Standardize each factor
for (var in factors) {
FAA_new[[paste(var, "standardized", sep = "_")]] <-
(FAA_new[[var]] - mean(FAA_new[[var]], na.rm = TRUE)) / sd(FAA_new[[var]], na.rm = TRUE)
}
# Initialize an empty data frame to store results
reg_results <- data.frame(
Variable = character(),
SizeOfCoefficient = numeric(),
Direction = character(),
stringsAsFactors = FALSE
)
# Loop through each standardized variable to perform regression
for (var in factors) {
# Define the regression formula using the standardized variable
formula <- as.formula(paste("distance ~", paste(var, "standardized", sep = "_")))
# Fit the linear model
model_standardized <- lm(formula, data = FAA_new)
# Get the coefficient for the standardized variable
coefficient2 <- summary(model_standardized)$coefficients[2, 1]
# Determine the direction of the coefficient
direction2 <- ifelse(coefficient2 > 0, "Positive", "Negative")
# Append the results to the data frame
reg_results <- rbind(reg_results, data.frame(
Variable = var,
SizeOfCoefficient = abs(coefficient2),
Direction = direction2
))
}
# Sort the data frame by Coefficient_Size in descending order
reg_results <- reg_results[order(reg_results$SizeOfCoefficient, decreasing = TRUE), ]
# Print the final table with a name
cat("Table 3")
Table 3
print(reg_results)
# Table 1: correlations (columns: Variable, SizeOfCorrelation, DirectionOfCorrelation)
# Table 2: regression_results (columns: Variable, P_Value, DirectionofCoefficient)
# Table 3: reg_results (columns: Variable, SizeOfCoefficient, Direction)
# Combine the tables into a single data frame
combined_results <- merge(correlations, regression_results, by = "Variable", suffixes = c("_Correlation", "_Significance"))
combined_results <- merge(combined_results, reg_results, by = "Variable")
# Add ranks for each metric
combined_results$Rank_Correlation <- rank(-abs(combined_results$SizeOfCorrelation), ties.method = "min")
combined_results$Rank_Significance <- rank(combined_results$P_Value, ties.method = "min")
combined_results$Rank_Coefficient <- rank(-combined_results$SizeOfCoefficient, ties.method = "min")
# Calculate the overall rank by averaging the ranks
combined_results$Overall_Rank <- rowMeans(combined_results[, c("Rank_Correlation", "Rank_Significance", "Rank_Coefficient")])
# Sort by the overall rank
combined_results <- combined_results[order(combined_results$Overall_Rank), ]
# Select the relevant columns for Table 0
table_0 <- combined_results[, c("Variable", "SizeOfCorrelation", "P_Value", "SizeOfCoefficient", "Overall_Rank")]
colnames(table_0) <- c("Variable", "Correlation", "P_Value", "Coefficient_Size", "Overall_Rank")
# Print Table 0
cat("Table 0")
Table 0
print(table_0)
# Fit Model 1: LD ~ Speed_ground
model1 <- lm(distance ~ speed_ground, data = FAA_new)
summary(model1)
Call:
lm(formula = distance ~ speed_ground, data = FAA_new)
Residuals:
Min 1Q Median 3Q Max
-714.00 -196.07 29.33 238.20 741.40
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5148.798 228.961 -22.49 <2e-16 ***
speed_ground 76.702 2.203 34.81 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 308.5 on 193 degrees of freedom
Multiple R-squared: 0.8626, Adjusted R-squared: 0.8619
F-statistic: 1212 on 1 and 193 DF, p-value: < 2.2e-16
# Fit Model 2: LD ~ Speed_air
model2 <- lm(distance ~ speed_air, data = FAA_new)
summary(model2)
Call:
lm(formula = distance ~ speed_air, data = FAA_new)
Residuals:
Min 1Q Median 3Q Max
-783.22 -189.61 2.73 215.76 623.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5417.607 208.860 -25.94 <2e-16 ***
speed_air 79.244 2.009 39.45 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 276.4 on 193 degrees of freedom
Multiple R-squared: 0.8897, Adjusted R-squared: 0.8891
F-statistic: 1556 on 1 and 193 DF, p-value: < 2.2e-16
# Fit Model 3: LD ~ Speed_ground + Speed_air
model3 <- lm(distance ~ speed_ground + speed_air, data = FAA_new)
summary(model3)
Call:
lm(formula = distance ~ speed_ground + speed_air, data = FAA_new)
Residuals:
Min 1Q Median 3Q Max
-820.6 -182.0 7.7 204.2 633.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5425.49 209.08 -25.950 < 2e-16 ***
speed_ground -12.32 12.98 -0.949 0.344
speed_air 91.63 13.20 6.941 5.82e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 276.5 on 192 degrees of freedom
Multiple R-squared: 0.8902, Adjusted R-squared: 0.889
F-statistic: 778.1 on 2 and 192 DF, p-value: < 2.2e-16
# Extract coefficients for each model
coeff_model1 <- summary(model1)$coefficients
coeff_model2 <- summary(model2)$coefficients
coeff_model3 <- summary(model3)$coefficients
# Display the coefficients
cat("Model 1 Coefficients:\n")
Model 1 Coefficients:
print(coeff_model1)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5148.79756 228.961387 -22.48762 8.139210e-56
speed_ground 76.70237 2.203368 34.81142 3.988133e-85
cat("\nModel 2 Coefficients:\n")
Model 2 Coefficients:
print(coeff_model2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5417.60675 208.860347 -25.93890 8.071172e-65
speed_air 79.24368 2.008797 39.44833 2.550801e-94
cat("\nModel 3 Coefficients:\n")
Model 3 Coefficients:
print(coeff_model3)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5425.48760 209.07862 -25.9495097 1.093361e-64
speed_ground -12.31906 12.97593 -0.9493783 3.436217e-01
speed_air 91.62992 13.20051 6.9413917 5.819901e-11
# Check the correlation between Speed_ground and Speed_air
correlation_speed <- cor(FAA_new$speed_ground, FAA_new$speed_air, use = "complete.obs")
cat("\nCorrelation between Speed_ground and Speed_air:", correlation_speed, "\n")
Correlation between Speed_ground and Speed_air: 0.9883475
The correlation between speed_ground and speed_air is 0.9883475, which is a very high value and close to 1. It indicates a very strong positive linear relationship between the two variables. As speed_ground increases, speed_air also increases almost proportionally. Such a high correlation suggests multicollinearity when both variables are used together, so we need to eliminate one. Both variables have very small P-value which suggests its significant influence. However, the value of adjusted R-squared for speed_air is higher, which indicates that speed_air explains a greater proportion of the variance in distance variable.
var_ranked <- c("speed_air", "speed_ground", "aircraft_numeric","height", "duration", "pitch", "no_pasg")
# Initialize a vector to store R-squared values
r_squared_values <- numeric(length(var_ranked))
# Fit models and calculate R-squared values
for (i in seq_along(var_ranked)) {
# Create the formula dynamically
formula <- as.formula(paste("distance ~", paste(var_ranked[1:i], collapse = " + ")))
# Fit the model
model_rank <- lm(formula, data = FAA_new)
# Extract the R-squared value
r_squared_values[i] <- summary(model_rank)$r.squared
}
# Create a data frame for plotting
r_squared_rank <- data.frame(
Num_Variables = seq_along(var_ranked),
R_Squared = r_squared_values
)
# Plot the R-squared values versus the number of variables
ggplot(r_squared_rank, aes(x = Num_Variables, y = R_Squared)) +
geom_line() +
geom_point() +
labs(title = "R-squared vs Number of Variables",
x = "Number of Variables",
y = "R-squared") +
theme_minimal()
# Print the R-squared values for reference
print(r_squared_rank)
It looks like the \(R^2\) value increases when the number of variables increase as well, however, the value is pretty stable when the variable reach four.
adj_r_squared_values <- numeric(length(var_ranked))
# Fit models and calculate Adjusted R-squared values
for (i in seq_along(var_ranked)) {
# Create the formula dynamically
formula_adj <- as.formula(paste("distance ~", paste(var_ranked[1:i], collapse = " + ")))
# Fit the model
model_adj <- lm(formula_adj, data = FAA_new)
# Extract the R-squared value
adj_r_squared_values[i] <- summary(model_adj)$adj.r.squared
}
# Create a data frame for plotting
adj_r_squared <- data.frame(
Num_Variables = seq_along(var_ranked),
Adj_R_Squared = adj_r_squared_values
)
# Plot the R-squared values versus the number of variables
ggplot(adj_r_squared, aes(x = Num_Variables, y = Adj_R_Squared)) +
geom_line() +
geom_point() +
labs(title = "Adjusted R-squared vs Number of Variables",
x = "Number of Variables",
y = "Adjusted R-squared") +
theme_minimal()
# Print the R-squared values for reference
print(adj_r_squared)
The value of adjusted R squared looks very similar compared to R squared values. The value increases when the number of variables increase, and became stable once it reaches four variables.
aic_values <- numeric(length(var_ranked))
# Fit models and calculate AIC values
for (i in seq_along(var_ranked)) {
# Create the formula dynamically
formula_aic <- as.formula(paste("distance ~", paste(var_ranked[1:i], collapse = " + ")))
# Fit the model
model_aic <- lm(formula_aic, data = FAA_new)
# Extract the AIC value
aic_values[i] <- AIC(model_aic)
}
# Create a data frame for plotting
aic_plot <- data.frame(
Num_Variables = seq_along(var_ranked),
AIC = aic_values
)
# Plot the R-squared values versus the number of variables
ggplot(aic_plot, aes(x = Num_Variables, y = AIC)) +
geom_line() +
geom_point() +
labs(title = "AIC vs Number of Variables",
x = "Number of Variables",
y = "AIC") +
theme_minimal()
# Print the R-squared values for reference
print(aic_plot)
It appears that the AIC decrease as more variables are added to the model. However, it stops decreasing when variable reaches four, and the lowest AIC value is when the variable number is four. It suggests the optimal model size could be four.
Based on the results in steps 18, 19, it indicates that four variable is the optimal choice, therefore the variables will be speed_air, speed_ground, aircraft, height. Although, from previous steps, we learned there is multicollinearity between speed_ground and speed_air, so we could eliminate one. Also, aircraft is a categorical variable, instead of including it, maybe have the other variable results grouped by aircraft could also be sufficient.
# Load the necessary library for stepAIC
library(MASS)
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
# Create a full model using all predictors
full_model <- lm(distance ~ ., data = FAA_new[, c("distance", var_ranked)])
# Perform forward selection using stepAIC
step_model <- stepAIC(full_model, direction = "forward", trace = FALSE)
# Summary of the stepwise-selected model
summary(step_model)
Call:
lm(formula = distance ~ speed_air + speed_ground + aircraft_numeric +
height + duration + pitch + no_pasg, data = FAA_new[, c("distance",
var_ranked)])
Residuals:
Min 1Q Median 3Q Max
-287.36 -85.69 11.25 83.50 359.85
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5791.6573 163.2346 -35.481 <2e-16 ***
speed_air 85.5469 6.5221 13.117 <2e-16 ***
speed_ground -3.5464 6.4160 -0.553 0.581
aircraft_numeric -437.9428 21.2621 -20.597 <2e-16 ***
height 13.6756 1.0386 13.168 <2e-16 ***
duration 0.1276 0.2039 0.626 0.532
pitch -13.4897 18.6077 -0.725 0.469
no_pasg -1.9812 1.3780 -1.438 0.152
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 134.5 on 187 degrees of freedom
Multiple R-squared: 0.9747, Adjusted R-squared: 0.9738
F-statistic: 1030 on 7 and 187 DF, p-value: < 2.2e-16
# Print the selected variables
selected_variables <- names(coef(step_model))
print(selected_variables)
[1] "(Intercept)" "speed_air" "speed_ground" "aircraft_numeric" "height"
[6] "duration" "pitch" "no_pasg"
# Compare the AIC value of the StepAIC selected model with manual AIC values
step_model_aic <- AIC(step_model)
manual_aic <- aic_values
# Print the AIC of the StepAIC selected model
print(paste("StepAIC Selected Model AIC:", step_model_aic))
[1] "StepAIC Selected Model AIC: 2474.69216451066"
# Compare with manual AIC values
print(manual_aic)
[1] 2749.962 2751.048 2597.804 2471.938 2473.334 2474.836 2474.692
From the result of stepAIC function, we can see that the AIC values are very much the same. StepAIC selected 7 variables, however, when looking at the summary of AIC model, there is only three showed significant impact, which is the similar results as the AIC calculation in Step 19 (speed_ground has been eliminated).