Set Working Directory
setwd("~/OneNote Notebooks/7042-Stat Modeling/Assignment 1")
Open Libraries
library(readxl) # To read in datasets
library(tidyverse) # For Data Manipulation
library(knitr) # To create tables
Read the two datasets into the system and name them. We name them for ease of use when calling in later coding exercises and analysis.
# Read in Dataset 1
FAA1 <- read_xls("FAA1-1.xls")
# Read in Dataset 2
FAA2 <- read_xls("FAA2-1.xls")
A good first step to begin exploration of the datasets is to check the structure of the datasets.
As seen below, FAA1-1 has 800 observations and 8 variables.
# Observe the structure of Dataset 1
str(FAA1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 800 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
As seen below, FAA2-1 has 150 observations and 7 variables.
# Observe the structure of Dataset 2
str(FAA2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 7 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
The differences between the two datasets (as observed from the R code and output) are listed below:
FAA1 has an additional variable called duration
which FAA2 does not have. (All other variables are identical)
FAA2 has much fewer observations than the FAA1 dataset.
In this step we will merge the two datasets. I used a left_join
to automatically filter out any duplicates that exist between the two datasets. Note that we performed a left_join
because the FAA1 dataset has an extra variable and more datapoints.
# Merge Datasets 1 and 2 via left_join. Join on a combination of columns since a unique ID does not exist.
FAA_merge <- left_join(x = FAA1, y = FAA2)
## Joining, by = c("aircraft", "no_pasg", "speed_ground", "speed_air", "height", "pitch", "distance")
# Oberve an overview of the merged dataset
glimpse(FAA_merge)
## Observations: 800
## Variables: 8
## $ aircraft <chr> "boeing", "boeing", "boeing", "boeing", "boeing",...
## $ duration <dbl> 98.47909, 125.73330, 112.01700, 196.82569, 90.095...
## $ no_pasg <dbl> 53, 69, 61, 56, 70, 55, 54, 57, 61, 56, 61, 54, 5...
## $ speed_ground <dbl> 107.91568, 101.65559, 71.05196, 85.81333, 59.8885...
## $ speed_air <dbl> 109.32838, 102.85141, NA, NA, NA, NA, NA, NA, NA,...
## $ height <dbl> 27.41892, 27.80472, 18.58939, 30.74460, 32.39769,...
## $ pitch <dbl> 4.043515, 4.117432, 4.434043, 3.884236, 4.026096,...
## $ distance <dbl> 3369.8364, 2987.8039, 1144.9224, 1664.2182, 1050....
Since the new dataset has 800 observations, we can conclude that the two datasets had 100 duplicate observations. Fifty unique observations from the FAA2 were joined with the FAA1 dataset.
As performed above, we can check the structure of the combined dataset.
# Observe the structure of the Merged Dataset
str(FAA_merge)
## Classes 'tbl_df', 'tbl' and 'data.frame': 800 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
The sample size of the dataset is 800 and the dataset has 8 variables, 7 of which are numeric. Below are the summary statistics for each variable.
# View the summary statistics of the Merged Dataset
summary(FAA_merge)
## aircraft duration no_pasg speed_ground
## Length:800 Min. : 14.76 Min. :29.00 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.00 1st Qu.: 65.87
## Mode :character Median :153.95 Median :60.00 Median : 79.64
## Mean :154.01 Mean :60.13 Mean : 79.54
## 3rd Qu.:188.91 3rd Qu.:65.00 3rd Qu.: 92.33
## Max. :305.62 Max. :87.00 Max. :141.22
##
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.16 1st Qu.:23.338 1st Qu.:3.658 1st Qu.: 900.95
## Median :100.99 Median :30.147 Median :4.020 Median :1267.44
## Mean :103.83 Mean :30.122 Mean :4.018 Mean :1544.52
## 3rd Qu.:109.48 3rd Qu.:36.981 3rd Qu.:4.388 3rd Qu.:1960.44
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :600
If we were to identify a few findings at this point in our analysis, below would be the points presented:
The landing distance of the aircraft is typically less than 6,000 feet. We see in this dataset, however, that there exist observations where distance
is greater than 6,000 feet. This means that some planes most likely overran their runway.
The landing aircraft is required to be at least 6 meters high at the threshold of the runway. We see from the summary statistics that our data input may not be entirely accurate, since our minimum value of height
is a negative number. This value should never be negative.
speed_air
contains 600 missing values and should be taken into consideration when performing analysis on this dataset.
It was noted in the Variable Dictionary that the duration of a normal flight should always be greater than 40 minutes; however, there exists an observation whose duration was 14.76 minutes (shown in the summary statistics).
It was noted in the Variable Dictionary that the ground speed and the air speed of an aircraft when passing over the threshold of the runway should be between 30 and 140 MPH to be considered normal. From the summary statistics, we see that there exist abnormal flights in this dataset. That is, there exist observations where speed_ground
or speed_air
is less than 30 or greater than 140 MPH.
As noted in step 5, there are multiple abnormal values in the dataset. A value is defined as “abnormal” in the data dictionary provided.
We see these abnormal flights primarily in the height
variable, where we have a negative value in an observation. We can see how many instances exist:
# Filter for observations with height value less than zero
FAA_merge %>%
filter(height < 0) %>%
nrow()
## [1] 5
There were 5 observations where height
was negative. Height should never be negative, due to physics.
If the duration of the flight is less than or equal to 40 minutes, the flight is considered abnormal. Let us filter for these observations to count how many exist.
# Filter for observations with flight duration value less than 40 minutes
FAA_merge %>%
filter(duration <= 40) %>%
nrow()
## [1] 5
There were 5 observations where duration
was less than or equal to 40 minutes.
Abnormal flights are also defined as those with speed_air
and/or speed_ground
less than 30 MPH or greater than 140 MPH.
# Filter for observations with air or ground speed values less than 30 or greater than 140 MPH.
FAA_merge %>%
filter(speed_air < 30 | speed_air > 140 |
speed_ground < 30 | speed_ground > 140) %>%
nrow()
## [1] 3
There were 3 observations where speed_air
or speed_ground
was less than 30 MPH or greater than 140 MPH.
Let us filter out these observations identified above and create a new working dataset.
# Filter dataset for only observations deemed normal
FAA_merge <- FAA_merge %>%
filter(height >= 0) %>%
filter(duration > 40) %>%
filter(speed_air >= 30| is.na(speed_air)) %>%
filter(speed_air <= 140 | is.na(speed_air)) %>%
filter(speed_ground >= 30) %>%
filter(speed_ground <= 140)
800 - nrow(FAA_merge)
## [1] 13
Thirteen total observations have been deemed abnormal and have consequently been removed.
Note: We have retained observations with missing values.
In step 7, we repeat step 4 but with the dataset that now only contains normal flights. We therefore check the structure of the combined dataset as well as provide summary statistics.
# Observe the structure of the Merged and Filtered Dataset
str(FAA_merge)
## Classes 'tbl_df', 'tbl' and 'data.frame': 787 obs. of 8 variables:
## $ aircraft : chr "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num 109 103 NA NA NA ...
## $ height : num 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num 3370 2988 1145 1664 1050 ...
# View the summary statistics of the Merged and Filtered Dataset
summary(FAA_merge)
## aircraft duration no_pasg speed_ground
## Length:787 Min. : 41.95 Min. :29.00 Min. : 33.57
## Class :character 1st Qu.:119.68 1st Qu.:55.00 1st Qu.: 66.01
## Mode :character Median :154.24 Median :60.00 Median : 79.73
## Mean :154.80 Mean :60.13 Mean : 79.61
## 3rd Qu.:189.62 3rd Qu.:65.00 3rd Qu.: 92.13
## Max. :305.62 Max. :87.00 Max. :136.66
##
## speed_air height pitch distance
## Min. : 90.00 Min. : 0.08611 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.16 1st Qu.:23.46558 1st Qu.:3.654 1st Qu.: 902.82
## Median :100.99 Median :30.16708 Median :4.014 Median :1267.76
## Mean :103.67 Mean :30.29378 Mean :4.015 Mean :1541.02
## 3rd Qu.:109.48 3rd Qu.:36.98364 3rd Qu.:4.385 3rd Qu.:1960.41
## Max. :136.42 Max. :59.94596 Max. :5.927 Max. :6309.95
## NA's :591
There are 787 observations and 8 variables in this dataset. The variable speed_air
has 591 missing values.
Let us create histograms for our 7 numeric variables. These histograms will provide insight regarding the distribution and variability of the observations within each variable. See Step 9 for conclusions regarding these histograms.
par(mfrow = c(4,2))
hist(FAA_merge$duration, main = "duration", xlab = NA)
hist(FAA_merge$no_pasg, main = "no_pasg", xlab = NA)
hist(FAA_merge$speed_ground, main = "speed_ground", xlab = NA)
hist(FAA_merge$speed_air, main = "speed_air", xlab = NA)
hist(FAA_merge$height, main = "height", xlab = NA)
hist(FAA_merge$pitch, main = "pitch", xlab = NA)
hist(FAA_merge$distance, main = "distance", xlab = NA)
Another “presentation slide” one might want to present to the FAA agents would include the following points:
Given the fact that our dataset no longer includes flights deemed “abnormal,” we can say that a normal flight’s ground speed on average is around 80 MPH, and the distribution of the variation in ground speed is symmetrical. This is different than the distribution of air speed, which is around 100 MPH but is strongly skewed to the right (meaning there are likely outliers).
The landing distance is highly skewed to the right as well, however very few landings go over 6000 feet.
Apart from speed_air
and distance
, most variables are evenly distributed. Not only do the histograms show this, but the mean and median (listed in the summary statistics) for each variable are similar.
There exist 206 observations with complete data. The entire dataset contains 787 unique observations.
These are basic observations. Further analysis can break down these statistics by carrier or aircraft size.
In this step we compute the pairwise correlation between the landing distance and each factor X. Since we are not concerned with the correlation between the various X factors, we will manually create this correlation table.
The table below (called Table 1) ranks the factors based on correlation strength (indicated by the absolute value) from strongest to weakest.
# Create a data frame by inputting data and correcting values
FAA_cor <- as.data.frame(cor(y = FAA_merge[,8], x = FAA_merge[,c(2:7)], use = "na.or.complete"))
FAA_cor <- FAA_cor %>%
mutate(X_Variable = rownames(FAA_cor),
Cor_Direction = "positive",
Distance = round(abs(distance),3)
)
FAA_cor[2,3] = "negative"
colnames(FAA_cor)[1] <- "Cor_with_Distance"
# Reorder the columns of the table
FAA_cor <- FAA_cor[,c(2,1,3)]
# Sort the table from strongest to weakest correlation
FAA_cor <- FAA_cor[order(FAA_cor$Cor_with_Distance, decreasing = T),]
#Print the table
kable(FAA_cor, row.names = F, caption = "Table 1", align = 'c', digits = 3)
X_Variable | Cor_with_Distance | Cor_Direction |
---|---|---|
speed_air | 0.945 | positive |
speed_ground | 0.932 | positive |
height | 0.086 | positive |
pitch | 0.037 | positive |
duration | 0.037 | positive |
no_pasg | -0.019 | negative |
The variables speed_air
and speed_ground
have the strongest correlation to the response variable distance
. Both correlations are positive, meaning that an increase in either air or ground speed correlates to a longer landing distance. The number of passagengers on the plane negatively correlates to the landing distance. That is, smaller planes are correlated to smaller landing distances.
We can also display scatterplots of the variables on our dependent variable. These scatterplots are great visual representations for our correlation table. It is clear that Ground Speed (speed_ground
) and Air Speed (speed_air
) are highly correlated with our response variable, distance
. We see this because the datapoints have small variability (less scattered) around a general upward trend. The other variables have no apparent pattern. It is interesting to note that speed_ground
has a slight curve to the pattern shown in the scatterplot, which potentially indicates a nonlinear relationship between the X variable and distance
.
par(mfrow = c(3,2))
plot(x = FAA_merge$duration, y = FAA_merge$distance, main = "Distance by Duration")
plot(x = FAA_merge$no_pasg, y = FAA_merge$distance, main = "Distance by Passanger Amount")
plot(x = FAA_merge$speed_ground, y = FAA_merge$distance, main = "Distance by Ground Speed")
plot(x = FAA_merge$speed_air, y = FAA_merge$distance, main = "Distance by Air Speed")
plot(x = FAA_merge$height, y = FAA_merge$distance, main = "Distance by Height")
plot(x = FAA_merge$pitch, y = FAA_merge$distance, main = "Distance by Pitch")
Above, our analysis only includes numeric variables. We can also include the aircraft
variable by coding the character variable as 0/1 binary. If aircraft
is equal to 1, it is a boeing aircraft. Otherwise, its make is airbus. After encoding this variable as a binary variable, we can repeat Steps 10 and 11 and include the variable in the table and scatterplots.
# Make `aircraft` binary
FAA_merge$aircraft [FAA_merge$aircraft == "boeing"] <- 1
FAA_merge$aircraft [FAA_merge$aircraft == "airbus"] <- 0
FAA_merge$aircraft <- as.integer(FAA_merge$aircraft)
# Repeat steps: Make and format a Data Table
FAA_cor.12 <- as.data.frame(cor(y = FAA_merge[,8], x = FAA_merge[,c(1:7)], use = "na.or.complete"))
FAA_cor.12 <- FAA_cor.12 %>%
mutate(X_Variable = rownames(FAA_cor.12),
Cor_Direction = "positive",
Distance = round(abs(distance),3)
)
FAA_cor.12[3,3] = "negative"
colnames(FAA_cor.12)[1] <- "Cor_with_Distance"
# Order the columns and Sort the rows
FAA_cor.12 <- FAA_cor.12[,c(2,1,3)]
FAA_cor.12 <- FAA_cor.12[order(FAA_cor.12$Cor_with_Distance, decreasing = T),]
# Print the Table
kable(FAA_cor.12, row.names = F, caption = "Table 1.1", align = 'c')
X_Variable | Cor_with_Distance | Cor_Direction |
---|---|---|
speed_air | 0.9452968 | positive |
speed_ground | 0.9317169 | positive |
aircraft | 0.1815418 | positive |
height | 0.0856803 | positive |
pitch | 0.0372342 | positive |
duration | 0.0367128 | positive |
no_pasg | -0.0187942 | negative |
We can illustrate again these correlations in Table 1.1 through various paired scatterplots. The aircraft correlation plot is consistent with the positive correlation given in the table. Note that all other scatterplots are the same as shown in step 11.
par(mfrow = c(4,2))
plot(x = FAA_merge$aircraft, y = FAA_merge$distance, main = "Distance by Pitch")
plot(x = FAA_merge$duration, y = FAA_merge$distance, main = "Distance by Duration")
plot(x = FAA_merge$no_pasg, y = FAA_merge$distance, main = "Distance by Passanger Amount")
plot(x = FAA_merge$speed_ground, y = FAA_merge$distance, main = "Distance by Ground Speed")
plot(x = FAA_merge$speed_air, y = FAA_merge$distance, main = "Distance by Air Speed")
plot(x = FAA_merge$height, y = FAA_merge$distance, main = "Distance by Height")
plot(x = FAA_merge$pitch, y = FAA_merge$distance, main = "Distance by Pitch")
In step 13, we regress Y on each of the X variables. This means we will have 7 separate linear models (one for each X variable in the dataset). We can summarize these linear models by creating a table (Table 2 below) that ranks these X factors by its significance regarding the response variable. In the table you will find the name of the X Variable, its significance regarding the response variable, and the direction of the factor’s coefficient (indicating a positive or negative relationship).
# Run Linear Models
lm1.13 <- lm(distance ~ aircraft, data = FAA_merge)
lm2.13 <- lm(distance ~ duration, data = FAA_merge)
lm3.13 <- lm(distance ~ no_pasg, data = FAA_merge)
lm4.13 <- lm(distance ~ speed_ground, data = FAA_merge)
lm5.13 <- lm(distance ~ speed_air, data = FAA_merge)
lm6.13 <- lm(distance ~ height, data = FAA_merge)
lm7.13 <- lm(distance ~ pitch, data = FAA_merge)
# Create the table that ranks the factors based on significance
a.13 <- colnames(FAA_merge)[1:7]
b.13 <- c(as.numeric(anova(lm1.13)$'Pr(>F)'[1]),
as.numeric(anova(lm2.13)$'Pr(>F)'[1]),
as.numeric(anova(lm3.13)$'Pr(>F)'[1]),
as.numeric(anova(lm4.13)$'Pr(>F)'[1]),
as.numeric(anova(lm5.13)$'Pr(>F)'[1]),
as.numeric(anova(lm6.13)$'Pr(>F)'[1]),
as.numeric(anova(lm7.13)$'Pr(>F)'[1]))
c.13 <- c(lm1.13$coefficients[2],
lm2.13$coefficients[2],
lm3.13$coefficients[2],
lm4.13$coefficients[2],
lm5.13$coefficients[2],
lm6.13$coefficients[2],
lm7.13$coefficients[2])
d.13 <- c('positive','negative','negative','positive','positive','positive','positive')
pvalues.13 <- data.frame(a.13,b.13,d.13)
colnames(pvalues.13) <- c("X_Variable", "P_Value", "Dir_of_Reg_Coeff")
# Print the table
kable(pvalues.13[order(pvalues.13$P_Value, decreasing = F),],row.names = F, caption = "Table 2", align = 'c')
X_Variable | P_Value | Dir_of_Reg_Coeff |
---|---|---|
speed_ground | 0.0000000 | positive |
speed_air | 0.0000000 | positive |
aircraft | 0.0000000 | positive |
height | 0.0003499 | positive |
pitch | 0.0607952 | positive |
duration | 0.1107031 | negative |
no_pasg | 0.5842035 | negative |
We can also standardize each X variable by creating a new variable. This will help us better compare the factors and their impact on the response variable. Standardizing these X variables means we interpret them differently. This output will tell us that with every 1 standard deviation increase in the X Variable, the Y Variable will increase by (1)*(the X’s Regression Coefficient) standard deviations.
# Standardize the X Variables
FAA_merge$aircraft_prime <- (FAA_merge$aircraft - mean(FAA_merge$aircraft)) / sd(FAA_merge$aircraft)
FAA_merge$duration_prime <- (FAA_merge$duration - mean(FAA_merge$duration)) / sd(FAA_merge$duration)
FAA_merge$no_pasg_prime <- (FAA_merge$no_pasg - mean(FAA_merge$no_pasg)) / sd(FAA_merge$no_pasg)
FAA_merge$speed_ground_prime <- (FAA_merge$speed_ground - mean(FAA_merge$speed_ground)) / sd(FAA_merge$speed_ground)
FAA_merge$speed_air_prime <- (FAA_merge$speed_air - mean(FAA_merge$speed_air, na.rm=T)) / sd(FAA_merge$speed_air, na.rm=T)
FAA_merge$height_prime <- (FAA_merge$height - mean(FAA_merge$height)) / sd(FAA_merge$height)
FAA_merge$pitch_prime <- (FAA_merge$pitch - mean(FAA_merge$pitch)) / sd(FAA_merge$pitch)
We now repeat step 13 and Regress Y on each of the X’ (X-Prime) Variables. The resulting summary table will tell us how many standard deviations the Y variable will increase (or decrease, if negative) with one standard deviation increase in the X variable.
# Regress Y on X'
lm1.14 <- lm(distance ~ aircraft_prime, data = FAA_merge)
lm2.14 <- lm(distance ~ duration_prime, data = FAA_merge)
lm3.14 <- lm(distance ~ no_pasg_prime, data = FAA_merge)
lm4.14 <- lm(distance ~ speed_ground_prime, data = FAA_merge)
lm5.14 <- lm(distance ~ speed_air_prime, data = FAA_merge)
lm6.14 <- lm(distance ~ height_prime, data = FAA_merge)
lm7.14 <- lm(distance ~ pitch_prime, data = FAA_merge)
# Create the data frame for our output table
a.14 <- colnames(FAA_merge)[9:15]
c.14 <- c(abs(lm1.14$coefficients[2]),
abs(lm2.14$coefficients[2]),
abs(lm3.14$coefficients[2]),
abs(lm4.14$coefficients[2]),
abs(lm5.14$coefficients[2]),
abs(lm6.14$coefficients[2]),
abs(lm7.14$coefficients[2]))
d.14 <- c('positive','negative','negative','positive','positive','positive','positive')
primevalues.14 <- data.frame(a.14,c.14,d.14)
colnames(primevalues.14) <- c("X_Variable", "Size_of_Reg_Coeff", "Dir_of_Reg_Coeff")
# Print the table
kable(primevalues.14[order(primevalues.14$Size_of_Reg_Coeff, decreasing = T),],row.names = F, caption = "Table 3", align = 'c')
X_Variable | Size_of_Reg_Coeff | Dir_of_Reg_Coeff |
---|---|---|
speed_air_prime | 818.06654 | positive |
speed_ground_prime | 799.07204 | positive |
aircraft_prime | 208.81380 | positive |
height_prime | 117.01884 | positive |
pitch_prime | 61.54969 | positive |
duration_prime | 52.37631 | negative |
no_pasg_prime | 17.98305 | negative |
Correlation strength is semi-consistent with regressor significance (specified by its P-value). This makes sense, because we would expect highly correlated regressors with the response variable to also have higher modeling power (note we are not saying that correlation implies causation). We can see in Table 0 the factors based on their relative importance in determining the landing distance.
# Generalize Standardized Table 3
a.14 <- colnames(FAA_merge)[1:7]
primevalues.14 <- data.frame(a.14,c.14,d.14)
colnames(primevalues.14) <- c("X_Variable", "Stdized_Coeff", "Dir_of_Stdized_Coeff")
# Join the Statistics from all 3 tables together
join1 <- full_join(FAA_cor.12, pvalues.13, by= 'X_Variable')
## Warning: Column `X_Variable` joining character vector and factor, coercing
## into character vector
join2 <- full_join(join1, primevalues.14, by= 'X_Variable')
## Warning: Column `X_Variable` joining character vector and factor, coercing
## into character vector
join2 <- join2[order(join2$Stdized_Coeff, decreasing = T),]
join2 <- join2[c(1,3:7,2),]
# Print the Table
kable(join2,row.names = F, caption = "Table 0", align = 'c')
X_Variable | Cor_with_Distance | Cor_Direction | P_Value | Dir_of_Reg_Coeff | Stdized_Coeff | Dir_of_Stdized_Coeff |
---|---|---|---|---|---|---|
speed_air | 0.9452968 | positive | 0.0000000 | positive | 818.06654 | positive |
aircraft | 0.1815418 | positive | 0.0000000 | positive | 208.81380 | positive |
height | 0.0856803 | positive | 0.0003499 | positive | 117.01884 | positive |
pitch | 0.0372342 | positive | 0.0607952 | positive | 61.54969 | positive |
duration | 0.0367128 | positive | 0.1107031 | negative | 52.37631 | negative |
no_pasg | -0.0187942 | negative | 0.5842035 | negative | 17.98305 | negative |
speed_ground | 0.9317169 | positive | 0.0000000 | positive | 799.07204 | positive |
Note that we put speed_ground
last because in the next step (Step 16), we determine to drop this variable from the model.
We are asked to consider the regression coefficients of the three models:
# Create the linear models
lm1 <- lm(distance ~ speed_ground, data = FAA_merge)
lm2 <- lm(distance ~ speed_air, data = FAA_merge)
lm3 <- lm(distance ~ speed_ground + speed_air, data = FAA_merge)
# Print the Coefficients of each model
par(mfrow=c(1,3))
kable(lm1$coefficients, caption = "Model 1", col.names = c("Coefficient"), digits = 1, format.args = list(big.mark = ','), align = "c")
Coefficient | |
---|---|
(Intercept) | -1,810.1 |
speed_ground | 42.1 |
kable(lm2$coefficients, caption = "Model 2", col.names = c("Coefficient"), digits = 1, format.args = list(big.mark = ','), align = "c")
Coefficient | |
---|---|
(Intercept) | -5,568.5 |
speed_air | 80.7 |
kable(lm3$coefficients, caption = "Model 3", col.names = c("Coefficient"), digits = 1, format.args = list(big.mark = ','), align = "c")
Coefficient | |
---|---|
(Intercept) | -5,576.4 |
speed_ground | -12.1 |
speed_air | 92.9 |
When we compare these three models, we see that speed_ground
not only changes drastically in size, but also sign. The variable speed_air
also changes quite a bit, but not nearly as much as speed_ground
. Adding or removing covariates from a regression model should not drastically impact the value of the regressor coefficient. If it does, it indicates that multicollinearity exists.
Let’s look at the collinearity between the two covariates used.
cor(FAA_merge[,4], FAA_merge[,5], use = "na.or.complete")
## speed_air
## speed_ground 0.988969
These two variables are highly correlated, with an R value of 0.989. Therefore only one of the two variables should be in the final model to avoid multicollinearity. We can drop the variable with the higer variance inflation factor.
vif(lm3)
## speed_ground speed_air
## 45.57813 45.57813
Since each have the same VIF, we can then drop the X variable that is less correlated with the response variable.
cor(FAA_merge[,c(4:5,8)], use = "na.or.complete")
## speed_ground speed_air distance
## speed_ground 1.0000000 0.9889690 0.9317169
## speed_air 0.9889690 1.0000000 0.9452968
## distance 0.9317169 0.9452968 1.0000000
Since speed_ground
is less correlated than speed_air
with distance
(albeit, the difference is small), we can choose to drop speed_ground
from the model.
We will now fit multiple models, each model having one more covariate than the preceding model.
lm1.17 <- lm(distance ~ speed_air, data = FAA_merge)
lm2.17 <- lm(distance ~ speed_air + aircraft, data = FAA_merge)
lm3.17 <- lm(distance ~ speed_air + aircraft + height, data = FAA_merge)
lm4.17 <- lm(distance ~ speed_air + aircraft + height + pitch, data = FAA_merge)
lm5.17 <- lm(distance ~ speed_air + aircraft + height + pitch + duration, data = FAA_merge)
lm6.17 <- lm(distance ~ speed_air + aircraft + height + pitch + duration + no_pasg, data = FAA_merge)
lm7.17 <- lm(distance ~ speed_air + aircraft + height + pitch + duration + no_pasg + speed_ground, data = FAA_merge)
We next can see (below) a common pattern where adding more covariates to the multiple linear regression model results in a higher R-squared.
# Calculate the R-Squared Values
rsqd <- c(summary(lm1.17)$r.squared,
summary(lm2.17)$r.squared,
summary(lm3.17)$r.squared,
summary(lm4.17)$r.squared,
summary(lm5.17)$r.squared,
summary(lm6.17)$r.squared,
summary(lm7.17)$r.squared
)
cov <- c(1:7)
# Plot
plot(x = cov, y = rsqd, main = "R-Squared as a Function of Covariate Quantity",
xlab = "Number of Covariates", ylab = "R-Squared", type = "b")
# Table
kable(cbind(cov, rsqd), digits = 4, align = "c")
cov | rsqd |
---|---|
1 | 0.8936 |
2 | 0.9513 |
3 | 0.9751 |
4 | 0.9751 |
5 | 0.9752 |
6 | 0.9754 |
7 | 0.9754 |
Let us plot the number of covariates against another statistic, Adjusted R-Squared. This statistic differs from regular R-Squared in that it penalizes models that have a larger number of covariates.
# Calculate the Adjusted R-Squared Values
adj.rsqd <- c(summary(lm1.17)$adj.r.squared,
summary(lm2.17)$adj.r.squared,
summary(lm3.17)$adj.r.squared,
summary(lm4.17)$adj.r.squared,
summary(lm5.17)$adj.r.squared,
summary(lm6.17)$adj.r.squared,
summary(lm7.17)$adj.r.squared
)
# Plot
plot(x = cov, y = adj.rsqd, main = "Adjusted R-Squared as a Function of Covariate
Quantity", xlab = "Number of Covariates", ylab = "Adjusted R-Squared", type = "b")
# Table
kable(cbind(cov, adj.rsqd), digits = 4, align = "c")
cov | adj.rsqd |
---|---|
1 | 0.8930 |
2 | 0.9507 |
3 | 0.9747 |
4 | 0.9746 |
5 | 0.9745 |
6 | 0.9746 |
7 | 0.9745 |
Here we actually see adjusted R-Squared fluctuate with ever additional covariate added, rather than before with the regular R-Squared when the trend was strictly increasing. We included a table of values to better show this.
Finally, let us plot the number of covariates against yet another statistic, AIC. The smaller the AIC, the better the fit the model is to the data.
# Calculate the AIC Values
aic.fig <- c(AIC(lm1.17),
AIC(lm2.17),
AIC(lm3.17),
AIC(lm4.17),
AIC(lm5.17),
AIC(lm6.17),
AIC(lm7.17)
)
# Plot
plot(x = cov, y = aic.fig, xlab = "Number of Covariates", ylab = "AIC", type = "b")
# Table
kable(cbind(cov, aic.fig), digits = 4, align = "c", format.args = list(big.mark = ","))
cov | aic.fig |
---|---|
1 | 2,773.274 |
2 | 2,622.261 |
3 | 2,492.778 |
4 | 2,494.348 |
5 | 2,495.985 |
6 | 2,496.322 |
7 | 2,498.048 |
All three statistics (R-Squared, Adjusted R-Squared, and AIC) point to Model 3 as the best predictive model for LD. Adjusted R-Squared is the highest and AIC is the lowest of all the models reviewed.
The three variables I would select to build a predictive model for LD are speed_air
, aircraft
, and height
.
summary(lm3.17)
##
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = FAA_merge)
##
## Residuals:
## Min 1Q Median 3Q Max
## -304.14 -93.22 14.88 91.95 425.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6484.5380 110.1352 -58.88 <2e-16 ***
## speed_air 82.8340 0.9769 84.79 <2e-16 ***
## aircraft 439.0176 20.2013 21.73 <2e-16 ***
## height 14.2244 1.0500 13.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 137.7 on 192 degrees of freedom
## (591 observations deleted due to missingness)
## Multiple R-squared: 0.9751, Adjusted R-squared: 0.9747
## F-statistic: 2504 on 3 and 192 DF, p-value: < 2.2e-16
We can use the R function “StepAIC” to perform forward variable selection and automatically select the best model.
# Load package for stepAIC function
library(MASS)
# Function
stepAIC(lm7.17)
## Start: AIC=1939.82
## distance ~ speed_air + aircraft + height + pitch + duration +
## no_pasg + speed_ground
##
## Df Sum of Sq RSS AIC
## - duration 1 3427 3593003 1938.0
## - speed_ground 1 5011 3594587 1938.1
## - pitch 1 9376 3598952 1938.3
## - no_pasg 1 30384 3619959 1939.5
## <none> 3589575 1939.8
## - speed_air 1 3161032 6750608 2061.6
## - height 1 3410044 6999620 2068.7
## - aircraft 1 7904682 11494257 2165.9
##
## Step: AIC=1938.01
## distance ~ speed_air + aircraft + height + pitch + no_pasg +
## speed_ground
##
## Df Sum of Sq RSS AIC
## - speed_ground 1 6293 3599296 1936.3
## - pitch 1 10041 3603043 1936.6
## - no_pasg 1 32056 3625059 1937.8
## <none> 3593003 1938.0
## - speed_air 1 3250783 6843786 2062.3
## - height 1 3434796 7027798 2067.5
## - aircraft 1 7901833 11494836 2163.9
##
## Step: AIC=1936.35
## distance ~ speed_air + aircraft + height + pitch + no_pasg
##
## Df Sum of Sq RSS AIC
## - pitch 1 8583 3607879 1934.8
## - no_pasg 1 32652 3631948 1936.1
## <none> 3599296 1936.3
## - height 1 3469592 7068888 2066.7
## - aircraft 1 7897070 11496366 2162.0
## - speed_air 1 136189067 139788363 2651.6
##
## Step: AIC=1934.82
## distance ~ speed_air + aircraft + height + no_pasg
##
## Df Sum of Sq RSS AIC
## - no_pasg 1 32041 3639920 1934.5
## <none> 3607879 1934.8
## - height 1 3476420 7084299 2065.1
## - aircraft 1 8871841 12479720 2176.1
## - speed_air 1 136312487 139920365 2649.8
##
## Step: AIC=1934.55
## distance ~ speed_air + aircraft + height
##
## Df Sum of Sq RSS AIC
## <none> 3639920 1934.5
## - height 1 3479278 7119197 2064.0
## - aircraft 1 8953580 12593499 2175.8
## - speed_air 1 136291471 139931391 2647.8
##
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = FAA_merge)
##
## Coefficients:
## (Intercept) speed_air aircraft height
## -6484.54 82.83 439.02 14.22
The stepAIC function provides the same model!
Step: AIC=1934.55
distance ~ speed_air + aircraft + height