Statistical Modeling- Assignment 1-Group 3

R Markdown

PART A: Data Loading, Manipulation and Exploration Load required libraries

options(warnings=-1)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.1

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.2

## Warning: package 'tibble' was built under R version 4.1.1

## Warning: package 'tidyr' was built under R version 4.1.1

## Warning: package 'readr' was built under R version 4.1.1

## Warning: package 'dplyr' was built under R version 4.1.1

## Warning: package 'stringr' was built under R version 4.1.1

## Warning: package 'forcats' was built under R version 4.1.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readr)
library(readxl)

## Warning: package 'readxl' was built under R version 4.1.1

library(MASS) #for stepAIC

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

*Reading the Data**

faa1<- read.csv('https://raw.githubusercontent.com/njuteresa2019/MS-BA-projects/main/FAA1.csv',header = TRUE, stringsAsFactors=TRUE)

faa2<- read.csv('https://raw.githubusercontent.com/njuteresa2019/MS-BA-projects/main/FAA2.csv', header = TRUE, stringsAsFactors=TRUE)

Step 2. Check the structure of each data set using the “str” function. For each data set, what is the sample size and how many variables? Is there any difference between the two data sets?

FAA1 has 800 observations on 8 variables. FAA2 has 150 observations on 7 variables.

On careful inspection, we observe that the data columns are identical in both the datasets, only difference being FAA has “duration” column which FAA2 doesn’t.

## Step 2
str(faa1)

## 'data.frame':    800 obs. of  8 variables:
##  $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : int  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

dim(faa1)

## [1] 800   8

str(faa2)

## 'data.frame':    150 obs. of  7 variables:
##  $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ no_pasg     : int  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

dim(faa2)

## [1] 150   7

Step 3. Merge the two data sets. Are there any duplications? Search “check duplicates in r” if you do not know how to check duplications. If the answer is “Yes”, what action you would take?

We performed a full join of faa1, and faa2 on columns : “ï..aircraft” “no_pasg” “speed_ground” “speed_air” “height” “pitch”
“distance”

Full join ensures that the data does not have duplicates. To validate it we check if the #rows of faa and distinct faa are the same and it is.

We may use the dplyr package to do the same analysis. We find a unique dataset with variable n: #occurences of a row with same values across variables. We observe that unique data frame has only 1 as the n value, implying no duplicates.

We checked for duplicates using the duplicate function in dplyr too.

faa<-full_join(faa1,faa2)

## Joining, by = c("ï..aircraft", "no_pasg", "speed_ground", "speed_air", "height", "pitch", "distance")

colnames(faa)

## [1] "ï..aircraft"  "duration"     "no_pasg"      "speed_ground" "speed_air"   
## [6] "height"       "pitch"        "distance"

###check if faa has duplicates###
ifelse(dim(distinct(faa))[1]<dim(faa)[1],1,0)

## [1] 0

unique<-faa%>%group_by_all%>%summarize(n())

## `summarise()` has grouped output by 'ï..aircraft', 'duration', 'no_pasg', 'speed_ground', 'speed_air', 'height', 'pitch'. You can override using the `.groups` argument.

faa %>%  
  duplicated() %>% 
  which()

## integer(0)

Step 4. Check the structure of the combined data set. What is the sample size and how many variables? Provide summary statistics for each variable.

From the summary stats, we observe some issues in our data. I have summarized the 5 most important ones in a descending order of importance. 1. The height column has -ve numbers, which might be a data entry issue and should be rectified. 2. Duration column having NAs- Inputation of NAs with mean/median can be done as the proportion is not much. 3. Speed_air column having NAs and a pretty high proportion. Column needs to be looked into. Studying if the NAs have a pattern or are at random, and studying these inputations can be done. 4. The column distance needs to be studied too as the max value seems unusually high. 5. The column speed ground needs to be studied too as the max value seems unusually high.

dim(faa)

## [1] 850   8

str(faa)

## 'data.frame':    850 obs. of  8 variables:
##  $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : int  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

summary(faa)

##  ï..aircraft     duration         no_pasg      speed_ground      speed_air     
##  airbus:450   Min.   : 14.76   Min.   :29.0   Min.   : 27.74   Min.   : 90.00  
##  boeing:400   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90   1st Qu.: 96.25  
##               Median :153.95   Median :60.0   Median : 79.64   Median :101.15  
##               Mean   :154.01   Mean   :60.1   Mean   : 79.45   Mean   :103.80  
##               3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06   3rd Qu.:109.40  
##               Max.   :305.62   Max.   :87.0   Max.   :141.22   Max.   :141.72  
##               NA's   :50                                       NA's   :642     
##      height           pitch          distance      
##  Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##

Data Cleaning and further exploration

Step 6. Are there abnormal values in the data set? Please refer to the variable dictionary for criteria defining “normal/abnormal” values. Remove the rows that contain any “abnormal values” and report how many rows you have removed.

Criteria for “normal/ abnormal values” includes the duration of the flight being greater than 40 feet, the ground and air speed speed of an aircraft being less than 30mph or higher than 140 mph, the aircraft being at least 6 meters high, and the length of the airport runway is typically less than 6000 feet.

We were able to remove any flight duration that were under 40 minutes and had Nas (as the prportion of NAs were less) and any aircraft height that was less than 0 meters because that is not possible.

Our group decided not to remove any abnormal values when looking at the ground/air speed of the aircraft and the length of the airport runway. We decided to include these because the values could be telling of real events, and in that situation it would be nice to know that these instances occurred.

We decided to include abnormal values that could still technically happen in order to create an accurate model. While they are considered “abnormal”, you could still have high or low ground/air speeds, a longer runway than 6000 feet, and a height less than 6 meters, and these could be telling of plane accidents.

faa <- faa[-which(faa$duration < 40), ]
faa <- faa[-which(faa$height < 0), ]

completeFun <- function(data, desiredCols) {
  completeVec <- complete.cases(data[, desiredCols])
  return(data[completeVec, ])
}

faa<-completeFun(faa, c("duration"))

Step 7. Repeat Step 4.

We now have 790 observations on 8 variables in our data frame. Summary statistics of each variable have been provided.

dim(faa)

## [1] 790   8

str(faa)

## 'data.frame':    790 obs. of  8 variables:
##  $ ï..aircraft : Factor w/ 2 levels "airbus","boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration    : num  98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : int  53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num  107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num  109 103 NA NA NA ...
##  $ height      : num  27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num  4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num  3370 2988 1145 1664 1050 ...

summary(faa)

##  ï..aircraft     duration         no_pasg       speed_ground   
##  airbus:395   Min.   : 41.95   Min.   :29.00   Min.   : 27.74  
##  boeing:395   1st Qu.:119.76   1st Qu.:55.00   1st Qu.: 65.90  
##               Median :154.26   Median :60.00   Median : 79.70  
##               Mean   :154.89   Mean   :60.11   Mean   : 79.56  
##               3rd Qu.:189.64   3rd Qu.:65.00   3rd Qu.: 92.13  
##               Max.   :305.62   Max.   :87.00   Max.   :141.22  
##                                                                
##    speed_air          height             pitch          distance      
##  Min.   : 90.00   Min.   : 0.08611   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.17   1st Qu.:23.46517   1st Qu.:3.656   1st Qu.: 904.55  
##  Median :101.10   Median :30.15607   Median :4.017   Median :1269.03  
##  Mean   :103.87   Mean   :30.26902   Mean   :4.017   Mean   :1546.48  
##  3rd Qu.:109.56   3rd Qu.:36.97715   3rd Qu.:4.388   3rd Qu.:1960.42  
##  Max.   :141.72   Max.   :59.94596   Max.   :5.927   Max.   :6533.05  
##  NA's   :593

colnames(faa)[1]='aircraft'
faa %>% dplyr::select(aircraft) %>% table()

## .
## airbus boeing 
##    395    395

Step 8. Since you have a small set of variables, you may want to show histograms for all of them.

data<-faa
for (i in 2:8)
{
u<-na.omit(data[,i]) 
plt <-hist(u, xlab=names(data)[i],col=c("pink"),main=NULL,freq = FALSE)
lines(density(u), lwd=5, col='blue')
print(plt)
}

## $breaks
##  [1]  40  60  80 100 120 140 160 180 200 220 240 260 280 300 320
## 
## $counts
##  [1]  14  32  68  85 109 128 109 107  75  28  18  10   5   2
## 
## $density
##  [1] 0.0008860759 0.0020253165 0.0043037975 0.0053797468 0.0068987342
##  [6] 0.0081012658 0.0068987342 0.0067721519 0.0047468354 0.0017721519
## [11] 0.0011392405 0.0006329114 0.0003164557 0.0001265823
## 
## $mids
##  [1]  50  70  90 110 130 150 170 190 210 230 250 270 290 310
## 
## $xname
## [1] "u"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1] 25 30 35 40 45 50 55 60 65 70 75 80 85 90
## 
## $counts
##  [1]   1   0   3  15  65 128 191 203 123  42  16   2   1
## 
## $density
##  [1] 0.0002531646 0.0000000000 0.0007594937 0.0037974684 0.0164556962
##  [6] 0.0324050633 0.0483544304 0.0513924051 0.0311392405 0.0106329114
## [11] 0.0040506329 0.0005063291 0.0002531646
## 
## $mids
##  [1] 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5 77.5 82.5 87.5
## 
## $xname
## [1] "u"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1]  20  30  40  50  60  70  80  90 100 110 120 130 140 150
## 
## $counts
##  [1]   2   9  38  76 126 151 156 125  58  31  14   3   1
## 
## $density
##  [1] 0.0002531646 0.0011392405 0.0048101266 0.0096202532 0.0159493671
##  [6] 0.0191139241 0.0197468354 0.0158227848 0.0073417722 0.0039240506
## [11] 0.0017721519 0.0003797468 0.0001265823
## 
## $mids
##  [1]  25  35  45  55  65  75  85  95 105 115 125 135 145
## 
## $xname
## [1] "u"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1]  90  95 100 105 110 115 120 125 130 135 140 145
## 
## $counts
##  [1] 40 49 32 30 20  7  8  7  2  1  1
## 
## $density
##  [1] 0.040609137 0.049746193 0.032487310 0.030456853 0.020304569 0.007106599
##  [7] 0.008121827 0.007106599 0.002030457 0.001015228 0.001015228
## 
## $mids
##  [1]  92.5  97.5 102.5 107.5 112.5 117.5 122.5 127.5 132.5 137.5 142.5
## 
## $xname
## [1] "u"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1]  0  5 10 15 20 25 30 35 40 45 50 55 60
## 
## $counts
##  [1]   5   8  37  80 103 158 147 122  71  42  12   5
## 
## $density
##  [1] 0.001265823 0.002025316 0.009367089 0.020253165 0.026075949 0.040000000
##  [7] 0.037215190 0.030886076 0.017974684 0.010632911 0.003037975 0.001265823
## 
## $mids
##  [1]  2.5  7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5
## 
## $xname
## [1] "u"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
## [1] 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
## 
## $counts
## [1]   1  16 113 255 265 118  20   2
## 
## $density
## [1] 0.002531646 0.040506329 0.286075949 0.645569620 0.670886076 0.298734177
## [7] 0.050632911 0.005063291
## 
## $mids
## [1] 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75
## 
## $xname
## [1] "u"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## $breaks
##  [1]    0  500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
## 
## $counts
##  [1]  42 208 220 134  84  40  27  12  11   5   5   0   1   1
## 
## $density
##  [1] 1.063291e-04 5.265823e-04 5.569620e-04 3.392405e-04 2.126582e-04
##  [6] 1.012658e-04 6.835443e-05 3.037975e-05 2.784810e-05 1.265823e-05
## [11] 1.265823e-05 0.000000e+00 2.531646e-06 2.531646e-06
## 
## $mids
##  [1]  250  750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750
## 
## $xname
## [1] "u"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Step 9. Prepare another presentation slide to summarize your findings drawn from the cleaned data set, using no more than five “bullet statements”.

— Summary Slide: Just by looking at the histograms- features pitch, height, speed at ground, number of passengers, duration seems to be symmetric. Pitch symmetric about 4units Height symmetric about 35units Speed at Ground symmetric about 90units Number of passengers symmetric about 65 Duration symmetric about 160units

Variables distance and speed_air appears to be right skewed/positive skewed. Potential reasons for this will be covered in the next bullet points.

—The distance variable appears to be positive skewed with a long tail towards the right- We first would find it important to note that there are some flights that have a landing distance greater than 6000. We again included this because although the runway is typically less than 6000 feet, it could be greater and abnormal values could lead us to believe an accident happened, which would be important to note.

–We would also want to discuss in a slide that there are air/ground speed values that can be considered abnormal. We decided to keep these values in the data for the same reasons as landing distance above, and would want to uncover why these values exist and if it is worth removing them from the dataset.

—Data Issues- There appears to be some problem with the air speed values. The air speed column seeem to have a lot of missing data. We have omitted those and tried to study the distribution.

–While we removed the negative values because it is not possible, there are flights that have an aircraft height of less than 6 meters. We would discuss that we did not remove these values as well because we think they could be indicative of plane accidents or some problem, but would want to dive deeper into what these values mean.

–After cleaning the data, we have equal observations in the 2 aircrafts. We have 395 rows of Boeing and 395 rows of Airbus ensuring an equal representation of both the aircrafts.

Initial analysis for identifying important factors that impact the response variable “landing distance”

Step 10. Compute the pairwise correlation between the landing distance and each factor X. Provide a table that ranks the factors based on the size (absolute value) of the correlation. This table contains three columns: the names of variables, the size of the correlation, the direction of the correlation (positive or negative). We call it Table 1, which will be used for comparison with our analysis later.

Distance and speed_air has a correlation of (0.945) while distance and speed_ground(0.932).This directs that flights with higher speed in both air and land have higher landing distance.

All the other remaining features do not seem to be having any strong correlation with the distance variable.

###correlation values in a table
faa1<-na.omit(faa)
table <- data.frame(cor(faa1[2:8], faa1$distance))
colnames(table) <- c("coefficient")
table <- tibble::rownames_to_column(table, "features") %>% 
  arrange(desc(abs(coefficient))) %>% 
  mutate(direction = ifelse(coefficient < 0, "negative","positive"))
Table1<-table
Table1

Step 11. Show X-Y scatter plots. Do you think the correlation strength observed in these plots is consistent with the values computed in Step 10?

The results from Table 1 as well as the scatter plot resonates with each other and directs towards the variable distance being strongly positively correlated with air and ground speeds and almost uncorrelated with the other variables.

GGally::ggpairs(
  data = faa[-1]
)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## Warning: Removed 593 rows containing missing values (geom_point).

## Warning: Removed 593 rows containing missing values (geom_point).

## Warning: Removed 593 rows containing missing values (geom_point).

## Warning: Removed 593 rows containing non-finite values (stat_density).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## Warning: Removed 593 rows containing missing values (geom_point).

## Warning: Removed 593 rows containing missing values (geom_point).

## Warning: Removed 593 rows containing missing values (geom_point).

Step 12. Have you included the airplane make as a possible factor in Steps 10-11? You can code this character variable as 0/1.

Using the aircraft type as a factor, we still pretty much observe the same results as before. The correlation seems to be uniform across the different aircraft classes.

GGally::ggpairs(
  data = faa,  diag = list(continuous =
  "densityDiag", discrete = "barDiag", na = "naDiag"),
   aes(colour = aircraft, alpha = 0.4))

## Warning: Removed 593 rows containing non-finite values (stat_boxplot).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 593 rows containing non-finite values (stat_bin).

## Warning: Removed 593 rows containing missing values (geom_point).

## Warning: Removed 593 rows containing missing values (geom_point).

## Warning: Removed 593 rows containing missing values (geom_point).

## Warning: Removed 593 rows containing non-finite values (stat_density).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 593 rows containing missing values

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 593 rows containing missing values (geom_point).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 593 rows containing missing values (geom_point).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 593 rows containing missing values (geom_point).

###Regression using a single factor each time Step 13. Regress Y (landing distance) on each of the X variables. Provide a table that ranks the factors based on its significance. The smaller the p-value, the more significant the factor. This table contains three columns: the names of variables, the size of the p-value, the direction of the regression coefficient (positive or negative). We call it Table 2.

mdl_duration <- lm (faa$distance ~ faa$duration)
mdl_speedgrnd <- lm (faa$distance ~ faa$speed_ground)
mdl_height <- lm (faa$distance ~ faa$height)
mdl_pitch <- lm (faa$distance ~ faa$pitch)
mdl_nopasg <- lm (faa$distance ~ faa$no_pasg)
mdl_speedair <- lm (faa$distance ~ faa$speed_air)
mdl_aircraft <- lm (faa$distance ~ faa$aircraft)

mod <- list(mdl_duration, mdl_speedgrnd, mdl_height, 
            mdl_pitch, mdl_nopasg, mdl_speedair, mdl_aircraft)
r2 <- lapply(mod, FUN = function(lin.mod) summary(lin.mod)$coefficients[8])
r3 <- lapply(mod, FUN = function(lin.mod) summary(lin.mod)$coefficients[2])
r4 <- list('duration','speed_ground','height','pitch','no_pasengers','speed_air','aircraft_type')

r <- do.call(rbind, Map(data.frame, features = r4, p_value = r2, Estimate = r3 ))

r %>% 
  mutate(direction = ifelse(Estimate >= 0,'Positive','Negative')) %>% 
  dplyr::select(features,p_value,direction) %>% 
  arrange(p_value) -> table_2

table_2

Step 14. Standardize each X variable. In other words, create a new variable X’= {X-mean(X)}/sd(X). The mean of X’ is 0 and its standard deviation is 1. Regress Y (landing distance) on each of the X’ variables. Provide a table that ranks the factors based on the size of the regression coefficient. The larger the size, the more important the factor. This table contains three columns: the names of variables, the size of the regression coefficient, the direction of the regression coefficient (positive or negative). We call it Table 3.

scaledfaa<-faa %>% mutate_at(c(3:8), ~(scale(.) %>% as.vector))
mdl_duration1 <- lm (scaledfaa$distance ~ faa$duration)
mdl_speedgrnd1 <- lm (scaledfaa$distance ~ faa$speed_ground)
mdl_height1 <- lm (scaledfaa$distance ~ faa$height)
mdl_pitch1 <- lm (scaledfaa$distance ~ faa$pitch)
mdl_nopasg1 <- lm (scaledfaa$distance ~ faa$no_pasg)
mdl_speedair1 <- lm (scaledfaa$distance ~ faa$speed_air)
mdl_aircraft1 <- lm (scaledfaa$distance ~ faa$aircraft)

models_norm1 <- list(mdl_duration1, mdl_speedgrnd1, mdl_height1, 
            mdl_pitch1, mdl_nopasg1, mdl_speedair1, mdl_aircraft1)

coeff_norm1 <- lapply(models_norm1, FUN = function(lin.mod) summary(lin.mod)$coefficients[2])
variable_name <- list('duration','speed_ground','height','pitch','no_pasengers','speed_air','aircraft_type')
tab <- do.call(rbind, Map(data.frame, features = variable_name, Estimate = coeff_norm1 ))

tab %>% 
  mutate(direction = ifelse(Estimate >= 0,'Positive','Negative')) %>%
  arrange(desc(Estimate)) -> table_3

table_3

Step 15. Compare Tables 1,2,3. Are the results consistent? At this point, you will meet with a FAA agent again.

Please provide a single table than ranks all the factors based on their relative importance in determining the landing distance. We call it Table 0.

names(Table1)<-c('features','correlation_coefficient','direction')
names(table_2)<-c('features','p-value','direction')
names(table_3)<-c('features','Regression Coefficient','direction')

merge_1 <- merge(table_2, table_3, by = 'features')
merge_1

merge_2 <- merge(merge_1, Table1, by = 'features')
merge_2

table0<-merge_2
table0 %>% arrange(desc('Regression Coefficient'))

Looking at Table0- a consolidated table with all the covariates, with the correlation value, p-value(univariate models), and the regression coefficient on the standardized dataset, we observe that speed_air, speed_ground, and pitch are the top 3 highly correlated variables.

Check collinearity

Step 16. Compare the regression coefficients of the three models below: Model 1: LD ~ Speed_ground Model 2: LD ~ Speed_air Model 3: LD ~ Speed_ground + Speed_air

Do you observe any significance change and sign change? Check the correlation between Speed_ground and Speed_air. You may want to keep one of them in the model selection. Which one would you pick? Why?

We have seen earlier from Table0, that the variables speed air and speed ground has a positive correlation with distance. When we individually regress distance with them, their regression coefficients are positive and alligns with what we saw before, but when we take them together the regression coefficient of speed_ground becomes negative.

This is a classic problem of multicollinearity- the coefficients are not trustworthy. Please note that multicollinearity doesn’t affect the model prediction. The model still has a good predictive power but not a good interpretability power.

Instead of taking both the variables, I am moving forward with speed_air as it can successfully explain 89% of the variation in distance while speed_ground can only explain around ~74%.

l1 <- lm(distance ~ speed_ground,data=faa)
summary(l1)

## 
## Call:
## lm(formula = distance ~ speed_ground, data = faa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -917.96 -323.28  -84.76  209.54 2399.03 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1792.2597    71.6332  -25.02   <2e-16 ***
## speed_ground    41.9653     0.8752   47.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 473.1 on 788 degrees of freedom
## Multiple R-squared:  0.7448, Adjusted R-squared:  0.7444 
## F-statistic:  2299 on 1 and 788 DF,  p-value: < 2.2e-16

l2 <- lm(distance ~ speed_air,data=faa)
summary(l2)

## 
## Call:
## lm(formula = distance ~ speed_air, data = faa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -801.96 -190.75    4.47  210.72  821.91 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5685.712    203.766  -27.90   <2e-16 ***
## speed_air      81.905      1.952   41.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 285.9 on 195 degrees of freedom
##   (593 observations deleted due to missingness)
## Multiple R-squared:  0.9003, Adjusted R-squared:  0.8998 
## F-statistic:  1761 on 1 and 195 DF,  p-value: < 2.2e-16

l3 <- lm(distance ~ speed_ground +speed_air,data=faa)
model <- list(l1,l2,l3)
summary(l3)

## 
## Call:
## lm(formula = distance ~ speed_ground + speed_air, data = faa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -841.12 -189.66   11.44  217.86  824.03 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5692.77     203.93 -27.915  < 2e-16 ***
## speed_ground   -12.93      13.41  -0.964    0.336    
## speed_air       94.89      13.62   6.969 4.85e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286 on 194 degrees of freedom
##   (593 observations deleted due to missingness)
## Multiple R-squared:  0.9008, Adjusted R-squared:  0.8997 
## F-statistic: 880.5 on 2 and 194 DF,  p-value: < 2.2e-16

coeffecient <- lapply(model,function(lin.mod) 
if (length(lin.mod$coefficients) == 2) lin.mod$coefficients[2]
else lin.mod$coefficients[2:3])

cor(faa1$speed_air,faa1$speed_ground)

## [1] 0.9896663

cor.test(faa$speed_ground, faa$speed_air, method = "pearson")$estimate

##       cor 
## 0.9896663

Variable selection based on our ranking in Table 0.

Step 17. Suppose in Table 0, the variable ranking is as follows: X1, X2, X3….. Please fit the following six models: Model 1: LD ~ X1 Model 2: LD ~ X1 + X2 Model 3: LD ~ X1 + X2 + X3 ………. Calculate the R-squared for each model. Plot these R-squared values versus the number of variables p. What patterns do you observe?

model0 <- lm(distance ~ 1,data=faa)
model1 <- lm(distance ~ speed_air,data=faa)
model2 <- lm(distance ~ speed_air + aircraft,data=faa)
model3 <- lm(distance ~ speed_air + aircraft + height ,data=faa)
model4 <- lm(distance ~ speed_air + aircraft + height + no_pasg  ,data=faa)
model5 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration ,data=faa)
model6 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration + pitch,data=faa)

model0.rsqr <- summary(model0)$r.squared
model1.rsqr <- summary(model1)$r.squared
model2.rsqr <- summary(model2)$r.squared
model3.rsqr <- summary(model3)$r.squared
model4.rsqr <- summary(model4)$r.squared
model5.rsqr <- summary(model5)$r.squared
model6.rsqr <- summary(model6)$r.squared

rsquare <- cbind(c(model0.rsqr, model1.rsqr, model2.rsqr, model3.rsqr, model4.rsqr, model5.rsqr, model6.rsqr), 0:6) 

 colnames(rsquare) <- c("rsquare","variables") 

 rsquare <-  as.data.frame(rsquare)
 
 rsquare %>% 
   ggplot(aes(x = variables, y = rsquare)) + 
   geom_line() + 
   xlab("no. of variables") +
   ylab("R-square") +
   theme_classic()

Step 18. Repeat Step 17 but use adjusted R-squared values instead.

model0.rsqr <- summary(model0)$adj.r.squared
model1.rsqr <- summary(model1)$adj.r.squared
model2.rsqr <- summary(model2)$adj.r.squared
model3.rsqr <- summary(model3)$adj.r.squared
model4.rsqr <- summary(model4)$adj.r.squared
model5.rsqr <- summary(model5)$adj.r.squared
model6.rsqr <- summary(model6)$adj.r.squared

rsquare <- cbind(c(model0.rsqr, model1.rsqr, model2.rsqr, model3.rsqr, model4.rsqr, model5.rsqr, model6.rsqr), 0:6) 

 colnames(rsquare) <- c("adj_rsquare","variables") 

 rsquare <-  as.data.frame(rsquare)
 rsquare$adj_rsquare<-round(rsquare$adj_rsquare,3)
 
 rsquare %>% 
   ggplot(aes(x = variables, y = adj_rsquare)) + 
   geom_line() + 
   geom_text(aes(label = adj_rsquare), size = 4) +
   xlab("no. of variables") +
   ylab("Adjusted R-square") +
   theme_classic()

Step 19. Repeat Step 17 but use AIC values instead.

aic_value <- as.data.frame(AIC(model0,model1,model2,model3,model4,model5,model6))

## Warning in AIC.default(model0, model1, model2, model3, model4, model5, model6):
## models are not all fitted to the same number of observations

print(aic_value)

##        df       AIC
## model0  2 13054.533
## model1  3  2791.420
## model2  4  2639.547
## model3  5  2516.714
## model4  6  2516.546
## model5  7  2518.136
## model6  8  2520.105

model0_AIC <- AIC(model0)
model1_AIC <- AIC(model1)
model2_AIC <- AIC(model2)
model3_AIC <- AIC(model3)
model4_AIC <- AIC(model4)
model5_AIC <- AIC(model5)
model6_AIC <- AIC(model6)

AIC <- cbind(c(model0_AIC, model1_AIC, model2_AIC, model3_AIC, model4_AIC, model5_AIC, model6_AIC),0:6)

colnames(AIC) <- c("AIC","variables")

AIC <- as.data.frame(AIC)

AIC$AIC<-round(AIC$AIC,1)

AIC %>% 
  ggplot(aes(x = variables, y = AIC)) +
  geom_line() +
  geom_text(aes(label = AIC), size = 4) +
  xlab("no. of variables")+
  ylab("AIC") +
  theme_classic()

Conclusions from the R2, adjusted R2 and AIC graphs:

The primary fucntion of all these 3 graphs is to direct towards the number of covariates to be considered in our model. From R2, we observe that as variables increase, the value increases. That would mean that even if we add garbage variables, R2 would be increasing. We cannot determine the #covariates just by studying R2.

From the adjusted R2 plot, we observe that after 4th variable addition, it starts to drop. So we will be considering 3 covariates in our regression.

From AIC plot, we observe that 4 variables has the least AIC and starts to increase after that.

Step 20. Compare the results in Steps 18-19, what variables would you select to build a predictive model for LD?

model4 <- lm(distance ~ speed_air + aircraft + height + no_pasg ,data=faa) provides the least AIC and the best results.

model4 <- lm(distance ~ speed_air + aircraft + height + no_pasg  ,data=faa)
summary(model4)

## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height + no_pasg, 
##     data = faa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -297.82  -97.26    8.78   86.28  453.45 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6446.1398   140.6783 -45.822   <2e-16 ***
## speed_air         83.7134     0.9679  86.485   <2e-16 ***
## aircraftboeing   442.2345    20.7094  21.354   <2e-16 ***
## height            14.1054     1.0767  13.100   <2e-16 ***
## no_pasg           -2.0984     1.4398  -1.457    0.147    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.3 on 192 degrees of freedom
##   (593 observations deleted due to missingness)
## Multiple R-squared:  0.976,  Adjusted R-squared:  0.9755 
## F-statistic:  1955 on 4 and 192 DF,  p-value: < 2.2e-16

Variable selection based on automate algorithm.

Step 21. Use the R function “StepAIC” to perform forward variable selection. Compare the result with that in Step 19.

faaNoNA <- na.exclude(faa)
#nrow(faaNoNA)
model01 <- lm(distance ~ 1,data=faaNoNA)
model11 <- lm(distance ~ speed_air,data=faaNoNA)
model21 <- lm(distance ~ speed_air + aircraft,data=faaNoNA)
model31 <- lm(distance ~ speed_air + aircraft + height ,data=faaNoNA)
model41 <- lm(distance ~ speed_air + aircraft + height + no_pasg  ,data=faaNoNA)
model51 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration ,data=faaNoNA)
model61 <- lm(distance ~ speed_air + aircraft + height + no_pasg + duration + pitch,data=faaNoNA)

MASS::stepAIC(model01,direction="forward",scope=list(upper=model61,lower=model01))

## Start:  AIC=2682.54
## distance ~ 1
## 
##             Df Sum of Sq       RSS    AIC
## + speed_air  1 143944957  15942768 2230.4
## + aircraft   1   5775391 154112334 2677.3
## <none>                   159887725 2682.5
## + pitch      1    991324 158896401 2683.3
## + height     1    707080 159180645 2683.7
## + duration   1    370042 159517683 2684.1
## + no_pasg    1    200305 159687420 2684.3
## 
## Step:  AIC=2230.36
## distance ~ speed_air
## 
##            Df Sum of Sq      RSS    AIC
## + aircraft  1   8642406  7300363 2078.5
## + height    1   2868393 13074375 2193.3
## + pitch     1   1062502 14880266 2218.8
## <none>                  15942768 2230.4
## + no_pasg   1    142329 15800439 2230.6
## + duration  1      8648 15934120 2232.2
## 
## Step:  AIC=2078.49
## distance ~ speed_air + aircraft
## 
##            Df Sum of Sq     RSS    AIC
## + height    1   3426505 3873858 1955.7
## <none>                  7300363 2078.5
## + duration  1     52208 7248155 2079.1
## + no_pasg   1     44258 7256104 2079.3
## + pitch     1      3705 7296658 2080.4
## 
## Step:  AIC=1955.65
## distance ~ speed_air + aircraft + height
## 
##            Df Sum of Sq     RSS    AIC
## + no_pasg   1     42386 3831472 1955.5
## <none>                  3873858 1955.7
## + duration  1     10970 3862888 1957.1
## + pitch     1       488 3873370 1957.6
## 
## Step:  AIC=1955.48
## distance ~ speed_air + aircraft + height + no_pasg
## 
##            Df Sum of Sq     RSS    AIC
## <none>                  3831472 1955.5
## + duration  1    7980.7 3823491 1957.1
## + pitch     1     755.5 3830716 1957.5

## 
## Call:
## lm(formula = distance ~ speed_air + aircraft + height + no_pasg, 
##     data = faaNoNA)
## 
## Coefficients:
##    (Intercept)       speed_air  aircraftboeing          height         no_pasg  
##      -6446.140          83.713         442.235          14.105          -2.098

Conclusion We observe that even the automated algorithm gives us the same result as we observed from our AIC plot: Model 4: Distance being regressed upon Speed_air, Air Craft Type, Height and Number of Passengers provides the best result.