Step 0

Set Working Directory

setwd("~/OneNote Notebooks/7042-Stat Modeling/Assignment 1")

Open Libraries

library(readxl)    # To read in datasets
library(tidyverse) # For Data Manipulation
library(knitr)     # To create tables

Initial Exploration of the Data

Step 1

Read the two datasets into the system and name them. We name them for ease of use when calling in later coding exercises and analysis.

# Read in Dataset 1
FAA1 <- read_xls("FAA1-1.xls")
# Read in Dataset 2
FAA2 <- read_xls("FAA2-1.xls")

Step 2

A good first step to begin exploration of the datasets is to check the structure of the datasets.

As seen below, FAA1-1 has 800 observations and 8 variables.

# Observe the structure of Dataset 1
str(FAA1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    800 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

As seen below, FAA2-1 has 150 observations and 7 variables.

# Observe the structure of Dataset 2
str(FAA2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    150 obs. of  7 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

The differences between the two datasets (as observed from the R code and output) are listed below:

FAA1 has an additional variable called duration which FAA2 does not have. (All other variables are identical)
FAA2 has much fewer observations than the FAA1 dataset.

Step 3

In this step we will merge the two datasets. I used a left_join to automatically filter out any duplicates that exist between the two datasets. Note that we performed a left_join because the FAA1 dataset has an extra variable and more datapoints.

# Merge Datasets 1 and 2 via left_join.  Join on a combination of columns since a unique ID does not exist.
FAA_merge <- left_join(x = FAA1, y = FAA2)

## Joining, by = c("aircraft", "no_pasg", "speed_ground", "speed_air", "height", "pitch", "distance")

# Oberve an overview of the merged dataset
glimpse(FAA_merge)

## Observations: 800
## Variables: 8
## $ aircraft     <chr> "boeing", "boeing", "boeing", "boeing", "boeing",...
## $ duration     <dbl> 98.47909, 125.73330, 112.01700, 196.82569, 90.095...
## $ no_pasg      <dbl> 53, 69, 61, 56, 70, 55, 54, 57, 61, 56, 61, 54, 5...
## $ speed_ground <dbl> 107.91568, 101.65559, 71.05196, 85.81333, 59.8885...
## $ speed_air    <dbl> 109.32838, 102.85141, NA, NA, NA, NA, NA, NA, NA,...
## $ height       <dbl> 27.41892, 27.80472, 18.58939, 30.74460, 32.39769,...
## $ pitch        <dbl> 4.043515, 4.117432, 4.434043, 3.884236, 4.026096,...
## $ distance     <dbl> 3369.8364, 2987.8039, 1144.9224, 1664.2182, 1050....

Since the new dataset has 800 observations, we can conclude that the two datasets had 100 duplicate observations. Fifty unique observations from the FAA2 were joined with the FAA1 dataset.

Step 4

As performed above, we can check the structure of the combined dataset.

# Observe the structure of the Merged Dataset
str(FAA_merge)

## Classes 'tbl_df', 'tbl' and 'data.frame':    800 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

The sample size of the dataset is 800 and the dataset has 8 variables, 7 of which are numeric. Below are the summary statistics for each variable.

# View the summary statistics of the Merged Dataset
summary(FAA_merge)

##    aircraft            duration         no_pasg       speed_ground   
##  Length:800         Min.   : 14.76   Min.   :29.00   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.00   1st Qu.: 65.87  
##  Mode  :character   Median :153.95   Median :60.00   Median : 79.64  
##                     Mean   :154.01   Mean   :60.13   Mean   : 79.54  
##                     3rd Qu.:188.91   3rd Qu.:65.00   3rd Qu.: 92.33  
##                     Max.   :305.62   Max.   :87.00   Max.   :141.22  
##                                                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.16   1st Qu.:23.338   1st Qu.:3.658   1st Qu.: 900.95  
##  Median :100.99   Median :30.147   Median :4.020   Median :1267.44  
##  Mean   :103.83   Mean   :30.122   Mean   :4.018   Mean   :1544.52  
##  3rd Qu.:109.48   3rd Qu.:36.981   3rd Qu.:4.388   3rd Qu.:1960.44  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :600

Step 5

If we were to identify a few findings at this point in our analysis, below would be the points presented:

The landing distance of the aircraft is typically less than 6,000 feet. We see in this dataset, however, that there exist observations where distance is greater than 6,000 feet. This means that some planes most likely overran their runway.
The landing aircraft is required to be at least 6 meters high at the threshold of the runway. We see from the summary statistics that our data input may not be entirely accurate, since our minimum value of height is a negative number. This value should never be negative.
speed_air contains 600 missing values and should be taken into consideration when performing analysis on this dataset.
It was noted in the Variable Dictionary that the duration of a normal flight should always be greater than 40 minutes; however, there exists an observation whose duration was 14.76 minutes (shown in the summary statistics).
It was noted in the Variable Dictionary that the ground speed and the air speed of an aircraft when passing over the threshold of the runway should be between 30 and 140 MPH to be considered normal. From the summary statistics, we see that there exist abnormal flights in this dataset. That is, there exist observations where speed_ground or speed_air is less than 30 or greater than 140 MPH.

Data Cleaning and Further Exploration

Step 6

As noted in step 5, there are multiple abnormal values in the dataset. A value is defined as “abnormal” in the data dictionary provided.

Height

We see these abnormal flights primarily in the height variable, where we have a negative value in an observation. We can see how many instances exist:

# Filter for observations with height value less than zero
FAA_merge %>% 
  filter(height < 0) %>% 
  nrow()

## [1] 5

There were 5 observations where height was negative. Height should never be negative, due to physics.

Duration

If the duration of the flight is less than or equal to 40 minutes, the flight is considered abnormal. Let us filter for these observations to count how many exist.

# Filter for observations with flight duration value less than 40 minutes
FAA_merge %>% 
  filter(duration <= 40) %>% 
  nrow()

## [1] 5

There were 5 observations where duration was less than or equal to 40 minutes.

Ground Speed and Air Speed

Abnormal flights are also defined as those with speed_air and/or speed_ground less than 30 MPH or greater than 140 MPH.

# Filter for observations with air or ground speed values less than 30 or greater than 140 MPH.
FAA_merge %>% 
  filter(speed_air < 30 | speed_air > 140 |
           speed_ground < 30 | speed_ground > 140) %>% 
  nrow()

## [1] 3

There were 3 observations where speed_air or speed_ground was less than 30 MPH or greater than 140 MPH.

Filtering out identified abnormal observations

Let us filter out these observations identified above and create a new working dataset.

# Filter dataset for only observations deemed normal
FAA_merge <- FAA_merge %>% 
  filter(height >= 0) %>% 
  filter(duration > 40) %>% 
  filter(speed_air >= 30| is.na(speed_air)) %>% 
  filter(speed_air <= 140 | is.na(speed_air)) %>% 
  filter(speed_ground >= 30) %>% 
  filter(speed_ground <= 140) 
 
800 - nrow(FAA_merge)

## [1] 13

Thirteen total observations have been deemed abnormal and have consequently been removed.

Note: We have retained observations with missing values.

Step 7

In step 7, we repeat step 4 but with the dataset that now only contains normal flights. We therefore check the structure of the combined dataset as well as provide summary statistics.

# Observe the structure of the Merged and Filtered Dataset
str(FAA_merge)

## Classes 'tbl_df', 'tbl' and 'data.frame':    787 obs. of  8 variables:
##  $ aircraft    : chr  "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

# View the summary statistics of the Merged and Filtered Dataset
summary(FAA_merge)

##    aircraft            duration         no_pasg       speed_ground   
##  Length:787         Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
##  Class :character   1st Qu.:119.68   1st Qu.:55.00   1st Qu.: 66.01  
##  Mode  :character   Median :154.24   Median :60.00   Median : 79.73  
##                     Mean   :154.80   Mean   :60.13   Mean   : 79.61  
##                     3rd Qu.:189.62   3rd Qu.:65.00   3rd Qu.: 92.13  
##                     Max.   :305.62   Max.   :87.00   Max.   :136.66  
##                                                                      
##    speed_air          height             pitch          distance      
##  Min.   : 90.00   Min.   : 0.08611   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.16   1st Qu.:23.46558   1st Qu.:3.654   1st Qu.: 902.82  
##  Median :100.99   Median :30.16708   Median :4.014   Median :1267.76  
##  Mean   :103.67   Mean   :30.29378   Mean   :4.015   Mean   :1541.02  
##  3rd Qu.:109.48   3rd Qu.:36.98364   3rd Qu.:4.385   3rd Qu.:1960.41  
##  Max.   :136.42   Max.   :59.94596   Max.   :5.927   Max.   :6309.95  
##  NA's   :591

There are 787 observations and 8 variables in this dataset. The variable speed_air has 591 missing values.

Step 8

Let us create histograms for our 7 numeric variables. These histograms will provide insight regarding the distribution and variability of the observations within each variable. See Step 9 for conclusions regarding these histograms.

par(mfrow = c(4,2))
hist(FAA_merge$duration, main = "duration", xlab = NA)
hist(FAA_merge$no_pasg, main = "no_pasg", xlab = NA)
hist(FAA_merge$speed_ground, main = "speed_ground", xlab = NA)
hist(FAA_merge$speed_air, main = "speed_air", xlab = NA)
hist(FAA_merge$height, main = "height", xlab = NA)
hist(FAA_merge$pitch, main = "pitch", xlab = NA)
hist(FAA_merge$distance, main = "distance", xlab = NA)

Step 9

Another “presentation slide” one might want to present to the FAA agents would include the following points:

Given the fact that our dataset no longer includes flights deemed “abnormal,” we can say that a normal flight’s ground speed on average is around 80 MPH, and the distribution of the variation in ground speed is symmetrical. This is different than the distribution of air speed, which is around 100 MPH but is strongly skewed to the right (meaning there are likely outliers).
The landing distance is highly skewed to the right as well, however very few landings go over 6000 feet.
Apart from speed_air and distance, most variables are evenly distributed. Not only do the histograms show this, but the mean and median (listed in the summary statistics) for each variable are similar.
There exist 206 observations with complete data. The entire dataset contains 787 unique observations.

These are basic observations. Further analysis can break down these statistics by carrier or aircraft size.

Initial Analysis for ID’ing Important Factors that Impact Response Variables

Step 10

In this step we compute the pairwise correlation between the landing distance and each factor X. Since we are not concerned with the correlation between the various X factors, we will manually create this correlation table.

The table below (called Table 1) ranks the factors based on correlation strength (indicated by the absolute value) from strongest to weakest.

# Create a data frame by inputting data and correcting values
FAA_cor <- as.data.frame(cor(y = FAA_merge[,8], x = FAA_merge[,c(2:7)], use = "na.or.complete")) 
FAA_cor <- FAA_cor %>% 
  mutate(X_Variable = rownames(FAA_cor),
         Cor_Direction = "positive",
         Distance = round(abs(distance),3)
  )
FAA_cor[2,3] = "negative"
colnames(FAA_cor)[1] <- "Cor_with_Distance"

# Reorder the columns of the table
FAA_cor <- FAA_cor[,c(2,1,3)]

# Sort the table from strongest to weakest correlation
FAA_cor <- FAA_cor[order(FAA_cor$Cor_with_Distance, decreasing = T),]

#Print the table
kable(FAA_cor, row.names = F, caption = "Table 1", align = 'c', digits = 3)

Table 1
X_Variable	Cor_with_Distance	Cor_Direction
speed_air	0.945	positive
speed_ground	0.932	positive
height	0.086	positive
pitch	0.037	positive
duration	0.037	positive
no_pasg	-0.019	negative

The variables speed_air and speed_ground have the strongest correlation to the response variable distance. Both correlations are positive, meaning that an increase in either air or ground speed correlates to a longer landing distance. The number of passagengers on the plane negatively correlates to the landing distance. That is, smaller planes are correlated to smaller landing distances.

Step 11

We can also display scatterplots of the variables on our dependent variable. These scatterplots are great visual representations for our correlation table. It is clear that Ground Speed (speed_ground) and Air Speed (speed_air) are highly correlated with our response variable, distance. We see this because the datapoints have small variability (less scattered) around a general upward trend. The other variables have no apparent pattern. It is interesting to note that speed_ground has a slight curve to the pattern shown in the scatterplot, which potentially indicates a nonlinear relationship between the X variable and distance.

par(mfrow = c(3,2))
plot(x = FAA_merge$duration, y = FAA_merge$distance, main = "Distance by Duration")
plot(x = FAA_merge$no_pasg, y = FAA_merge$distance, main = "Distance by Passanger Amount")
plot(x = FAA_merge$speed_ground, y = FAA_merge$distance, main = "Distance by Ground Speed")
plot(x = FAA_merge$speed_air, y = FAA_merge$distance, main = "Distance by Air Speed")
plot(x = FAA_merge$height, y = FAA_merge$distance, main = "Distance by Height")
plot(x = FAA_merge$pitch, y = FAA_merge$distance, main = "Distance by Pitch")

Step 12

Above, our analysis only includes numeric variables. We can also include the aircraft variable by coding the character variable as 0/1 binary. If aircraft is equal to 1, it is a boeing aircraft. Otherwise, its make is airbus. After encoding this variable as a binary variable, we can repeat Steps 10 and 11 and include the variable in the table and scatterplots.

# Make `aircraft` binary
FAA_merge$aircraft [FAA_merge$aircraft == "boeing"] <- 1
FAA_merge$aircraft [FAA_merge$aircraft == "airbus"] <- 0
FAA_merge$aircraft <- as.integer(FAA_merge$aircraft)

# Repeat steps:  Make and format a Data Table
FAA_cor.12 <- as.data.frame(cor(y = FAA_merge[,8], x = FAA_merge[,c(1:7)], use = "na.or.complete")) 
FAA_cor.12 <- FAA_cor.12 %>% 
  mutate(X_Variable = rownames(FAA_cor.12),
         Cor_Direction = "positive",
         Distance = round(abs(distance),3)
  )
FAA_cor.12[3,3] = "negative"
colnames(FAA_cor.12)[1] <- "Cor_with_Distance"

# Order the columns and Sort the rows
FAA_cor.12 <- FAA_cor.12[,c(2,1,3)]
FAA_cor.12 <- FAA_cor.12[order(FAA_cor.12$Cor_with_Distance, decreasing = T),]

# Print the Table
kable(FAA_cor.12, row.names = F, caption = "Table 1.1", align = 'c')

Table 1.1
X_Variable	Cor_with_Distance	Cor_Direction
speed_air	0.9452968	positive
speed_ground	0.9317169	positive
aircraft	0.1815418	positive
height	0.0856803	positive
pitch	0.0372342	positive
duration	0.0367128	positive
no_pasg	-0.0187942	negative

We can illustrate again these correlations in Table 1.1 through various paired scatterplots. The aircraft correlation plot is consistent with the positive correlation given in the table. Note that all other scatterplots are the same as shown in step 11.

par(mfrow = c(4,2))
plot(x = FAA_merge$aircraft, y = FAA_merge$distance, main = "Distance by Pitch")
plot(x = FAA_merge$duration, y = FAA_merge$distance, main = "Distance by Duration")
plot(x = FAA_merge$no_pasg, y = FAA_merge$distance, main = "Distance by Passanger Amount")
plot(x = FAA_merge$speed_ground, y = FAA_merge$distance, main = "Distance by Ground Speed")
plot(x = FAA_merge$speed_air, y = FAA_merge$distance, main = "Distance by Air Speed")
plot(x = FAA_merge$height, y = FAA_merge$distance, main = "Distance by Height")
plot(x = FAA_merge$pitch, y = FAA_merge$distance, main = "Distance by Pitch")

Regression Using a Single Factor

Step 13

In step 13, we regress Y on each of the X variables. This means we will have 7 separate linear models (one for each X variable in the dataset). We can summarize these linear models by creating a table (Table 2 below) that ranks these X factors by its significance regarding the response variable. In the table you will find the name of the X Variable, its significance regarding the response variable, and the direction of the factor’s coefficient (indicating a positive or negative relationship).

# Run Linear Models
lm1.13 <- lm(distance ~ aircraft, data = FAA_merge)
lm2.13 <- lm(distance ~ duration, data = FAA_merge)
lm3.13 <- lm(distance ~ no_pasg, data = FAA_merge)
lm4.13 <- lm(distance ~ speed_ground, data = FAA_merge)
lm5.13 <- lm(distance ~ speed_air, data = FAA_merge)
lm6.13 <- lm(distance ~ height, data = FAA_merge)
lm7.13 <- lm(distance ~ pitch, data = FAA_merge)

# Create the table that ranks the factors based on significance
a.13 <- colnames(FAA_merge)[1:7]
b.13 <- c(as.numeric(anova(lm1.13)$'Pr(>F)'[1]),
           as.numeric(anova(lm2.13)$'Pr(>F)'[1]),
           as.numeric(anova(lm3.13)$'Pr(>F)'[1]),
           as.numeric(anova(lm4.13)$'Pr(>F)'[1]),
           as.numeric(anova(lm5.13)$'Pr(>F)'[1]),
           as.numeric(anova(lm6.13)$'Pr(>F)'[1]),
           as.numeric(anova(lm7.13)$'Pr(>F)'[1]))
c.13 <- c(lm1.13$coefficients[2],
       lm2.13$coefficients[2],
       lm3.13$coefficients[2],
       lm4.13$coefficients[2],
       lm5.13$coefficients[2],
       lm6.13$coefficients[2],
       lm7.13$coefficients[2])
d.13 <- c('positive','negative','negative','positive','positive','positive','positive')
pvalues.13 <- data.frame(a.13,b.13,d.13)
colnames(pvalues.13) <- c("X_Variable", "P_Value", "Dir_of_Reg_Coeff")

# Print the table
kable(pvalues.13[order(pvalues.13$P_Value, decreasing = F),],row.names = F, caption = "Table 2", align = 'c')

Table 2
X_Variable	P_Value	Dir_of_Reg_Coeff
speed_ground	0.0000000	positive
speed_air	0.0000000	positive
aircraft	0.0000000	positive
height	0.0003499	positive
pitch	0.0607952	positive
duration	0.1107031	negative
no_pasg	0.5842035	negative

Step 14

We can also standardize each X variable by creating a new variable. This will help us better compare the factors and their impact on the response variable. Standardizing these X variables means we interpret them differently. This output will tell us that with every 1 standard deviation increase in the X Variable, the Y Variable will increase by (1)*(the X’s Regression Coefficient) standard deviations.

# Standardize the X Variables
FAA_merge$aircraft_prime <- (FAA_merge$aircraft - mean(FAA_merge$aircraft)) / sd(FAA_merge$aircraft)
FAA_merge$duration_prime <- (FAA_merge$duration - mean(FAA_merge$duration)) / sd(FAA_merge$duration)
FAA_merge$no_pasg_prime <- (FAA_merge$no_pasg - mean(FAA_merge$no_pasg)) / sd(FAA_merge$no_pasg)
FAA_merge$speed_ground_prime <- (FAA_merge$speed_ground - mean(FAA_merge$speed_ground)) / sd(FAA_merge$speed_ground)
FAA_merge$speed_air_prime <- (FAA_merge$speed_air - mean(FAA_merge$speed_air, na.rm=T)) / sd(FAA_merge$speed_air, na.rm=T)
FAA_merge$height_prime <- (FAA_merge$height - mean(FAA_merge$height)) / sd(FAA_merge$height)
FAA_merge$pitch_prime <- (FAA_merge$pitch - mean(FAA_merge$pitch)) / sd(FAA_merge$pitch)

We now repeat step 13 and Regress Y on each of the X’ (X-Prime) Variables. The resulting summary table will tell us how many standard deviations the Y variable will increase (or decrease, if negative) with one standard deviation increase in the X variable.

# Regress Y on X'
lm1.14 <- lm(distance ~ aircraft_prime, data = FAA_merge)
lm2.14 <- lm(distance ~ duration_prime, data = FAA_merge)
lm3.14 <- lm(distance ~ no_pasg_prime, data = FAA_merge)
lm4.14 <- lm(distance ~ speed_ground_prime, data = FAA_merge)
lm5.14 <- lm(distance ~ speed_air_prime, data = FAA_merge)
lm6.14 <- lm(distance ~ height_prime, data = FAA_merge)
lm7.14 <- lm(distance ~ pitch_prime, data = FAA_merge)

# Create the data frame for our output table
a.14 <- colnames(FAA_merge)[9:15]
c.14 <- c(abs(lm1.14$coefficients[2]),
       abs(lm2.14$coefficients[2]),
       abs(lm3.14$coefficients[2]),
       abs(lm4.14$coefficients[2]),
       abs(lm5.14$coefficients[2]),
       abs(lm6.14$coefficients[2]),
       abs(lm7.14$coefficients[2]))
d.14 <- c('positive','negative','negative','positive','positive','positive','positive')
primevalues.14 <- data.frame(a.14,c.14,d.14)
colnames(primevalues.14) <- c("X_Variable", "Size_of_Reg_Coeff", "Dir_of_Reg_Coeff")

# Print the table
kable(primevalues.14[order(primevalues.14$Size_of_Reg_Coeff, decreasing = T),],row.names = F, caption = "Table 3", align = 'c')

Table 3
X_Variable	Size_of_Reg_Coeff	Dir_of_Reg_Coeff
speed_air_prime	818.06654	positive
speed_ground_prime	799.07204	positive
aircraft_prime	208.81380	positive
height_prime	117.01884	positive
pitch_prime	61.54969	positive
duration_prime	52.37631	negative
no_pasg_prime	17.98305	negative

Step 15

Correlation strength is semi-consistent with regressor significance (specified by its P-value). This makes sense, because we would expect highly correlated regressors with the response variable to also have higher modeling power (note we are not saying that correlation implies causation). We can see in Table 0 the factors based on their relative importance in determining the landing distance.

# Generalize Standardized Table 3
a.14 <- colnames(FAA_merge)[1:7]
primevalues.14 <- data.frame(a.14,c.14,d.14)
colnames(primevalues.14) <- c("X_Variable", "Stdized_Coeff", "Dir_of_Stdized_Coeff")

# Join the Statistics from all 3 tables together
join1 <- full_join(FAA_cor.12, pvalues.13, by= 'X_Variable')

## Warning: Column `X_Variable` joining character vector and factor, coercing
## into character vector

join2 <- full_join(join1, primevalues.14, by= 'X_Variable')

## Warning: Column `X_Variable` joining character vector and factor, coercing
## into character vector

join2 <- join2[order(join2$Stdized_Coeff, decreasing = T),]
join2 <- join2[c(1,3:7,2),]

# Print the Table
kable(join2,row.names = F, caption = "Table 0", align = 'c')

Table 0
X_Variable	Cor_with_Distance	Cor_Direction	P_Value	Dir_of_Reg_Coeff	Stdized_Coeff	Dir_of_Stdized_Coeff
speed_air	0.9452968	positive	0.0000000	positive	818.06654	positive
aircraft	0.1815418	positive	0.0000000	positive	208.81380	positive
height	0.0856803	positive	0.0003499	positive	117.01884	positive
pitch	0.0372342	positive	0.0607952	positive	61.54969	positive
duration	0.0367128	positive	0.1107031	negative	52.37631	negative
no_pasg	-0.0187942	negative	0.5842035	negative	17.98305	negative
speed_ground	0.9317169	positive	0.0000000	positive	799.07204	positive

Note that we put speed_ground last because in the next step (Step 16), we determine to drop this variable from the model.

Check Collinearity

Step 16

We are asked to consider the regression coefficients of the three models:

# Create the linear models
lm1 <- lm(distance ~ speed_ground, data = FAA_merge)
lm2 <- lm(distance ~ speed_air, data = FAA_merge)
lm3 <- lm(distance ~ speed_ground + speed_air, data = FAA_merge)

# Print the Coefficients of each model
par(mfrow=c(1,3))
kable(lm1$coefficients, caption = "Model 1", col.names = c("Coefficient"), digits = 1, format.args = list(big.mark = ','), align = "c")

Model 1
	Coefficient
(Intercept)	-1,810.1
speed_ground	42.1

kable(lm2$coefficients, caption = "Model 2", col.names = c("Coefficient"), digits = 1, format.args = list(big.mark = ','), align = "c")

Model 2
	Coefficient
(Intercept)	-5,568.5
speed_air	80.7

kable(lm3$coefficients, caption = "Model 3", col.names = c("Coefficient"), digits = 1, format.args = list(big.mark = ','), align = "c")

Model 3
	Coefficient
(Intercept)	-5,576.4
speed_ground	-12.1
speed_air	92.9

When we compare these three models, we see that speed_ground not only changes drastically in size, but also sign. The variable speed_air also changes quite a bit, but not nearly as much as speed_ground. Adding or removing covariates from a regression model should not drastically impact the value of the regressor coefficient. If it does, it indicates that multicollinearity exists.

Let’s look at the collinearity between the two covariates used.

cor(FAA_merge[,4], FAA_merge[,5], use = "na.or.complete")

##              speed_air
## speed_ground  0.988969

These two variables are highly correlated, with an R value of 0.989. Therefore only one of the two variables should be in the final model to avoid multicollinearity. We can drop the variable with the higer variance inflation factor.

vif(lm3)

## speed_ground    speed_air 
##     45.57813     45.57813

Since each have the same VIF, we can then drop the X variable that is less correlated with the response variable.

cor(FAA_merge[,c(4:5,8)], use = "na.or.complete")

##              speed_ground speed_air  distance
## speed_ground    1.0000000 0.9889690 0.9317169
## speed_air       0.9889690 1.0000000 0.9452968
## distance        0.9317169 0.9452968 1.0000000

Since speed_ground is less correlated than speed_air with distance (albeit, the difference is small), we can choose to drop speed_ground from the model.

Variable Selection based on our ranking in Table 0.

Step 17

We will now fit multiple models, each model having one more covariate than the preceding model.

lm1.17 <- lm(distance ~ speed_air, data = FAA_merge)
lm2.17 <- lm(distance ~ speed_air + aircraft, data = FAA_merge)
lm3.17 <- lm(distance ~ speed_air + aircraft + height, data = FAA_merge)
lm4.17 <- lm(distance ~ speed_air + aircraft + height + pitch, data = FAA_merge)
lm5.17 <- lm(distance ~ speed_air + aircraft + height + pitch + duration, data = FAA_merge)
lm6.17 <- lm(distance ~ speed_air + aircraft + height + pitch + duration + no_pasg, data = FAA_merge)
lm7.17 <- lm(distance ~ speed_air + aircraft + height + pitch + duration + no_pasg + speed_ground, data = FAA_merge)

We next can see (below) a common pattern where adding more covariates to the multiple linear regression model results in a higher R-squared.

# Calculate the R-Squared Values
rsqd <- c(summary(lm1.17)$r.squared,
       summary(lm2.17)$r.squared,
       summary(lm3.17)$r.squared,
       summary(lm4.17)$r.squared,
       summary(lm5.17)$r.squared,
       summary(lm6.17)$r.squared,
       summary(lm7.17)$r.squared
       )
cov <- c(1:7)

# Plot
plot(x = cov, y = rsqd, main = "R-Squared as a Function of Covariate Quantity", 
     xlab = "Number of Covariates", ylab = "R-Squared", type = "b")

# Table
kable(cbind(cov, rsqd), digits = 4, align = "c")

cov	rsqd
1	0.8936
2	0.9513
3	0.9751
4	0.9751
5	0.9752
6	0.9754
7	0.9754

Step 18

Let us plot the number of covariates against another statistic, Adjusted R-Squared. This statistic differs from regular R-Squared in that it penalizes models that have a larger number of covariates.

# Calculate the Adjusted R-Squared Values
adj.rsqd <- c(summary(lm1.17)$adj.r.squared,
          summary(lm2.17)$adj.r.squared,
          summary(lm3.17)$adj.r.squared,
          summary(lm4.17)$adj.r.squared,
          summary(lm5.17)$adj.r.squared,
          summary(lm6.17)$adj.r.squared,
          summary(lm7.17)$adj.r.squared
          )

# Plot
plot(x = cov, y = adj.rsqd, main = "Adjusted R-Squared as a Function of Covariate
     Quantity", xlab = "Number of Covariates", ylab = "Adjusted R-Squared", type = "b")

# Table
kable(cbind(cov, adj.rsqd), digits = 4, align = "c")

cov	adj.rsqd
1	0.8930
2	0.9507
3	0.9747
4	0.9746
5	0.9745
6	0.9746
7	0.9745

Here we actually see adjusted R-Squared fluctuate with ever additional covariate added, rather than before with the regular R-Squared when the trend was strictly increasing. We included a table of values to better show this.

Step 19

Finally, let us plot the number of covariates against yet another statistic, AIC. The smaller the AIC, the better the fit the model is to the data.

# Calculate the AIC Values
aic.fig <- c(AIC(lm1.17),
              AIC(lm2.17),
              AIC(lm3.17),
              AIC(lm4.17),
              AIC(lm5.17),
              AIC(lm6.17),
              AIC(lm7.17)
        )

# Plot
plot(x = cov, y = aic.fig, xlab = "Number of Covariates", ylab = "AIC", type = "b")

# Table
kable(cbind(cov, aic.fig), digits = 4, align = "c", format.args = list(big.mark = ","))

cov	aic.fig
1	2,773.274
2	2,622.261
3	2,492.778
4	2,494.348
5	2,495.985
6	2,496.322
7	2,498.048

Step 20

All three statistics (R-Squared, Adjusted R-Squared, and AIC) point to Model 3 as the best predictive model for LD. Adjusted R-Squared is the highest and AIC is the lowest of all the models reviewed.

The three variables I would select to build a predictive model for LD are speed_air, aircraft, and height.

summary(lm3.17)

## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = FAA_merge)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -304.14  -93.22   14.88   91.95  425.03 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6484.5380   110.1352  -58.88   <2e-16 ***
## speed_air      82.8340     0.9769   84.79   <2e-16 ***
## aircraft      439.0176    20.2013   21.73   <2e-16 ***
## height         14.2244     1.0500   13.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 137.7 on 192 degrees of freedom
##   (591 observations deleted due to missingness)
## Multiple R-squared:  0.9751, Adjusted R-squared:  0.9747 
## F-statistic:  2504 on 3 and 192 DF,  p-value: < 2.2e-16

Variable Selection based on Automatic Algorithm

Step 21

We can use the R function “StepAIC” to perform forward variable selection and automatically select the best model.

# Load package for stepAIC function
library(MASS)

# Function
stepAIC(lm7.17)

## Start:  AIC=1939.82
## distance ~ speed_air + aircraft + height + pitch + duration + 
##     no_pasg + speed_ground
## 
##                Df Sum of Sq      RSS    AIC
## - duration      1      3427  3593003 1938.0
## - speed_ground  1      5011  3594587 1938.1
## - pitch         1      9376  3598952 1938.3
## - no_pasg       1     30384  3619959 1939.5
## <none>                       3589575 1939.8
## - speed_air     1   3161032  6750608 2061.6
## - height        1   3410044  6999620 2068.7
## - aircraft      1   7904682 11494257 2165.9
## 
## Step:  AIC=1938.01
## distance ~ speed_air + aircraft + height + pitch + no_pasg + 
##     speed_ground
## 
##                Df Sum of Sq      RSS    AIC
## - speed_ground  1      6293  3599296 1936.3
## - pitch         1     10041  3603043 1936.6
## - no_pasg       1     32056  3625059 1937.8
## <none>                       3593003 1938.0
## - speed_air     1   3250783  6843786 2062.3
## - height        1   3434796  7027798 2067.5
## - aircraft      1   7901833 11494836 2163.9
## 
## Step:  AIC=1936.35
## distance ~ speed_air + aircraft + height + pitch + no_pasg
## 
##             Df Sum of Sq       RSS    AIC
## - pitch      1      8583   3607879 1934.8
## - no_pasg    1     32652   3631948 1936.1
## <none>                     3599296 1936.3
## - height     1   3469592   7068888 2066.7
## - aircraft   1   7897070  11496366 2162.0
## - speed_air  1 136189067 139788363 2651.6
## 
## Step:  AIC=1934.82
## distance ~ speed_air + aircraft + height + no_pasg
## 
##             Df Sum of Sq       RSS    AIC
## - no_pasg    1     32041   3639920 1934.5
## <none>                     3607879 1934.8
## - height     1   3476420   7084299 2065.1
## - aircraft   1   8871841  12479720 2176.1
## - speed_air  1 136312487 139920365 2649.8
## 
## Step:  AIC=1934.55
## distance ~ speed_air + aircraft + height
## 
##             Df Sum of Sq       RSS    AIC
## <none>                     3639920 1934.5
## - height     1   3479278   7119197 2064.0
## - aircraft   1   8953580  12593499 2175.8
## - speed_air  1 136291471 139931391 2647.8

## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height, data = FAA_merge)
## 
## Coefficients:
## (Intercept)    speed_air     aircraft       height  
##    -6484.54        82.83       439.02        14.22

The stepAIC function provides the same model!

Step: AIC=1934.55
distance ~ speed_air + aircraft + height

Stat Model - HW1

Sam Rickett

1/20/2020