Packages Used

library(readxl)   
library(knitr)     
library(dplyr)
library(tidyverse)
library(faraway)
library(ggplot2)
library(nnet)

Variable Descriptions

Aircraft: The make of an aircraft (Boeing or Airbus).
Duration (in minutes): Flight duration between taking off and landing. The duration of a normal flight should always be greater than 40min.
No_pasg: The number of passengers in a flight.
Speed_ground (in miles per hour): The ground speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an aircraft when passing over the threshold of the runway. If its value is less than 30MPH or greater than 140MPH, then the landing would be considered as abnormal.
Height (in meters): The height of an aircraft when it is passing over the threshold of the runway. The landing aircraft is required to be at least 6 meters high at the threshold of the runway.
Pitch (in degrees): Pitch angle of an aircraft when it is passing over the threshold of the runway.
Distance (in feet): The landing distance of an aircraft. More specifically, it refers to the distance between the threshold of the runway and the point where the aircraft can be fully stopped. The length of the airport runway is typically less than 6000 feet.

Steps 1-9 of Part-1 of the Project:

Step 1: Reading the datasets

Read the two files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ into your R system. Please search “Read Excel files from R” in Google in case you do not know how to do that.

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
setwd("C:/Users/renju/OneDrive/Documents/MSBANA/Spring/Spring Term 1/Statistical Modelling - 7042/Week 2")
# Read in Dataset 1
FAA1 <- read_xls("FAA1.xls")
# Read in Dataset 2
FAA2 <- read_xls("FAA2.xls")

Step 2: Let us look at the structure and dimension of the datasets and check whether the datasets datatypes and column names are appropriate or not.

## tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:800] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:800] 109 103 NA NA NA ...
##  $ height      : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:800] 3370 2988 1145 1664 1050 ...

We can observe that there are 800 observations an 8 variables in the faa1 dataset.

## tibble [150 × 7] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
##  $ no_pasg     : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:150] 109 103 NA NA NA ...
##  $ height      : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:150] 3370 2988 1145 1664 1050 ...

We can observe that there are 150 observations an 7 variables in the faa2 dataset and we created an empty duration column to merge the datasets.

Step 3: Merging the datasets and checking for duplicate records.

## [1] 950   8
## [1] 100

we can observe that there are no duplicate observations in the merged dataset.

Step 4: Let us look at the structure of the merged dataset, dimension and summary statistics for each variable.

## tibble [850 × 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
##  $ duration    : num [1:850] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:850] 109 103 NA NA NA ...
##  $ height      : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:850] 3370 2988 1145 1664 1050 ...

We can observe that there are 850 observations and 8 variables in the merged dataset.

Summary statistics for each variable:

##    aircraft            duration         no_pasg      speed_ground   
##  Length:850         Min.   : 14.76   Min.   :29.0   Min.   : 27.74  
##  Class :character   1st Qu.:119.49   1st Qu.:55.0   1st Qu.: 65.90  
##  Mode  :character   Median :153.95   Median :60.0   Median : 79.64  
##                     Mean   :154.01   Mean   :60.1   Mean   : 79.45  
##                     3rd Qu.:188.91   3rd Qu.:65.0   3rd Qu.: 92.06  
##                     Max.   :305.62   Max.   :87.0   Max.   :141.22  
##                     NA's   :50                                      
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   :-3.546   Min.   :2.284   Min.   :  34.08  
##  1st Qu.: 96.25   1st Qu.:23.314   1st Qu.:3.642   1st Qu.: 883.79  
##  Median :101.15   Median :30.093   Median :4.008   Median :1258.09  
##  Mean   :103.80   Mean   :30.144   Mean   :4.009   Mean   :1526.02  
##  3rd Qu.:109.40   3rd Qu.:36.993   3rd Qu.:4.377   3rd Qu.:1936.95  
##  Max.   :141.72   Max.   :59.946   Max.   :5.927   Max.   :6533.05  
##  NA's   :642
## [1] 850   8

we can observe that there are no duplicate observations in the merged dataset.

Let us look at the structure of the merged dataset, dimension and summary statistics for each variable. We can observe that there are 850 observations and 8 variables in the merged dataset.

Step 5:

  1. There are few abnormal observations in few columns which can be removed further in our analysis.
  2. There are few missing values in the columns - duration and speed_air.
  3. The data is balanced as the proportion of data for both the aircrafts are almost the same.

Step 6:

Checking for abnormal values in the variables of the dataset and removing them.

## [1] 850   8

We have removed 23 observations by filtering out the observations with abnormal values.

Step 7:

Let us again look at the structure of the merged dataset, dimension and summary statistics for each variable.

## tibble [831 × 8] (S3: tbl_df/tbl/data.frame)
##  $ aircraft    : num [1:831] 0 0 0 0 0 0 0 0 0 0 ...
##  $ duration    : num [1:831] 98.5 125.7 112 196.8 90.1 ...
##  $ no_pasg     : num [1:831] 53 69 61 56 70 55 54 57 61 56 ...
##  $ speed_ground: num [1:831] 107.9 101.7 71.1 85.8 59.9 ...
##  $ speed_air   : num [1:831] 109 103 NA NA NA ...
##  $ height      : num [1:831] 27.4 27.8 18.6 30.7 32.4 ...
##  $ pitch       : num [1:831] 4.04 4.12 4.43 3.88 4.03 ...
##  $ distance    : num [1:831] 3370 2988 1145 1664 1050 ...
##     aircraft         duration         no_pasg       speed_ground   
##  Min.   :0.0000   Min.   : 41.95   Min.   :29.00   Min.   : 33.57  
##  1st Qu.:0.0000   1st Qu.:119.63   1st Qu.:55.00   1st Qu.: 66.20  
##  Median :1.0000   Median :154.28   Median :60.00   Median : 79.79  
##  Mean   :0.5343   Mean   :154.78   Mean   :60.06   Mean   : 79.54  
##  3rd Qu.:1.0000   3rd Qu.:189.66   3rd Qu.:65.00   3rd Qu.: 91.91  
##  Max.   :1.0000   Max.   :305.62   Max.   :87.00   Max.   :132.78  
##                   NA's   :50                                       
##    speed_air          height           pitch          distance      
##  Min.   : 90.00   Min.   : 6.228   Min.   :2.284   Min.   :  41.72  
##  1st Qu.: 96.23   1st Qu.:23.530   1st Qu.:3.640   1st Qu.: 893.28  
##  Median :101.12   Median :30.167   Median :4.001   Median :1262.15  
##  Mean   :103.48   Mean   :30.458   Mean   :4.005   Mean   :1522.48  
##  3rd Qu.:109.36   3rd Qu.:37.004   3rd Qu.:4.370   3rd Qu.:1936.63  
##  Max.   :132.91   Max.   :59.946   Max.   :5.927   Max.   :5381.96  
##  NA's   :628

Step 8: Let us plot histograms for all the numerical variables.

Step 9:

  1. we can observe from the histograms that the columns - duration, no_pasg, Speed_ground, height and pitch are almost symmetric.
  2. The speed_air column in right skewed from the histogram.
  3. The distance column is also right skewed.

Part 3

Question -1:

Creating a multinomial variable Y and attaching it to the dataset and discarding the distance variable. Also, duration and speed_air values are removed as it contains missing values and can mislead the interpretation as it influences the AIC value.

The analysis begins with the creation of a multinomial variable Y, which categorizes flights based on their distance into three groups: short-haul (<1000), medium-haul (1000-2500), and long-haul (>=2500). The original distance variable is then removed, as it has been encoded into Y. Additionally, variables duration and speed_air are dropped due to missing values, which could otherwise lead to misleading interpretations and influence the Akaike Information Criterion (AIC) value in model selection.

Let us now build a model on the multinomial data.A multinomial logistic regression model (mmod) is then built using the modified dataset, incorporating all remaining variables as predictors of Y. The model summary provides insights into the estimated coefficients and the statistical significance of each variable. To improve model efficiency and interpretability, stepwise regression (step(mmod)) is applied to select the best model based on AIC. The reduced model (mmodi) is expected to balance predictive accuracy and complexity, as a lower AIC indicates a more optimal model.

## # weights:  21 (12 variable)
## initial  value 912.946812 
## iter  10 value 550.235799
## iter  20 value 230.358678
## iter  30 value 213.877248
## iter  40 value 212.913549
## iter  50 value 212.082977
## iter  50 value 212.082976
## iter  50 value 212.082976
## final  value 212.082976 
## converged
## Call:
## multinom(formula = Y ~ ., data = FAA)
## 
## Coefficients:
##   (Intercept)  aircraft     no_pasg speed_ground    height      pitch
## 2   -17.01063 -4.113469 -0.02504101    0.2497376 0.1501956 -0.2454421
## 3  -131.64017 -9.245194 -0.02461303    1.2720814 0.4078697  1.2926411
## 
## Std. Errors:
##   (Intercept)  aircraft    no_pasg speed_ground     height     pitch
## 2  2.06989910 0.4253515 0.01745237   0.02010602 0.01741352 0.2636494
## 3  0.03667468 0.8835064 0.05666093   0.04211307 0.04645304 0.7289671
## 
## Residual Deviance: 424.166 
## AIC: 448.166

Let us select the model based on AIC.

## Start:  AIC=448.17
## Y ~ aircraft + no_pasg + speed_ground + height + pitch
## 
## trying - aircraft 
## # weights:  18 (10 variable)
## initial  value 912.946812 
## iter  10 value 498.723154
## iter  20 value 311.411199
## iter  30 value 310.489234
## final  value 310.455711 
## converged
## trying - no_pasg 
## # weights:  18 (10 variable)
## initial  value 912.946812 
## iter  10 value 399.910023
## iter  20 value 227.526993
## iter  30 value 219.561515
## iter  40 value 213.131118
## final  value 213.129119 
## converged
## trying - speed_ground 
## # weights:  18 (10 variable)
## initial  value 912.946812 
## iter  10 value 760.358638
## final  value 755.295778 
## converged
## trying - height 
## # weights:  18 (10 variable)
## initial  value 912.946812 
## iter  10 value 415.116775
## iter  20 value 280.800027
## iter  30 value 279.414846
## iter  40 value 279.375833
## final  value 279.375822 
## converged
## trying - pitch 
## # weights:  18 (10 variable)
## initial  value 912.946812 
## iter  10 value 393.555153
## iter  20 value 226.292093
## iter  30 value 219.449254
## iter  40 value 214.431376
## final  value 214.431034 
## converged
##                Df       AIC
## - no_pasg      10  446.2582
## <none>         12  448.1660
## - pitch        10  448.8621
## - height       10  578.7516
## - aircraft     10  640.9114
## - speed_ground 10 1530.5916
## # weights:  18 (10 variable)
## initial  value 912.946812 
## iter  10 value 399.910023
## iter  20 value 227.526993
## iter  30 value 219.561515
## iter  40 value 213.131118
## final  value 213.129119 
## converged
## 
## Step:  AIC=446.26
## Y ~ aircraft + speed_ground + height + pitch
## 
## trying - aircraft 
## # weights:  15 (8 variable)
## initial  value 912.946812 
## iter  10 value 371.812672
## iter  20 value 311.876265
## iter  30 value 311.256957
## final  value 311.208398 
## converged
## trying - speed_ground 
## # weights:  15 (8 variable)
## initial  value 912.946812 
## iter  10 value 755.684263
## final  value 755.576400 
## converged
## trying - height 
## # weights:  15 (8 variable)
## initial  value 912.946812 
## iter  10 value 330.387665
## iter  20 value 282.022563
## iter  30 value 280.226638
## final  value 280.107902 
## converged
## trying - pitch 
## # weights:  15 (8 variable)
## initial  value 912.946812 
## iter  10 value 301.732146
## iter  20 value 228.159485
## iter  30 value 218.475866
## iter  40 value 215.476581
## final  value 215.476420 
## converged
##                Df       AIC
## <none>         10  446.2582
## - pitch         8  446.9528
## - height        8  576.2158
## - aircraft      8  638.4168
## - speed_ground  8 1527.1528

To compare the two models, the difference in deviance (deviance(mmodi) - deviance(mmod)) and the difference in degrees of freedom (mmod$edf - mmodi$edf) are computed. A Chi-squared goodness-of-fit test is then performed to determine whether the reduced model fits the data significantly worse than the original model. The resulting p-value is greater than 0.05, meaning we fail to reject the null hypothesis. This suggests that the simplified model retains its predictive performance while reducing unnecessary complexity.

## Difference in Deviance between the models: 2.092285
## Difference in Degrees of Freedom between the models: 2
## Chi-squared test p-value: 0.3512902
## Since the p-value is greater than 0.05, we fail to reject the null hypothesis.
## The reduced model (mmodi) is preferred as it is not significantly worse than the full model (mmod).

An important observation from the stepwise selection process is that the pitch variable does not significantly contribute to predicting Y. Its exclusion does not negatively impact the model’s accuracy, further justifying the decision to retain the reduced model. Ultimately, the final model (mmodi) is selected for its lower AIC value and improved interpretability. This streamlined model can now be effectively used for categorizing flight distances while maintaining efficiency and accuracy. From the above Chi Squared (goodness of fit) test, we can see that the p value is greater than 0.05 i.e., significance level. Hence we can reject the null hypothesis. The model with AIC is selected among the two models as the AIC value is low and the model is less complex when compared to the previous model. We can also see that the model with pitch variable is not adding much value to the response variable.

## Call:
## multinom(formula = Y ~ aircraft + speed_ground + height + pitch, 
##     data = FAA)
## 
## Coefficients:
##   (Intercept)  aircraft speed_ground    height      pitch
## 2   -18.38264 -4.088694     0.248289 0.1483432 -0.2435132
## 3  -133.05937 -9.220876     1.271200 0.4062727  1.2965530
## 
## Std. Errors:
##   (Intercept)  aircraft speed_ground     height     pitch
## 2  1.86536167 0.4227738   0.01998616 0.01732038 0.2637264
## 3  0.03508062 0.8790709   0.03136898 0.03966359 0.7282156
## 
## Residual Deviance: 426.2582 
## AIC: 446.2582

Let us plot the histograms for the independent variables against the dependent variable Y and visualize the distribution.

The speed ground vs. Y histogram reveals that higher speeds (above 95) are more strongly associated with Y = 3, indicating that long-haul flights tend to have higher speeds. In contrast, shorter flights (Y = 1) are concentrated in lower speed ranges. Although there is some overlap between Y = 2 and Y = 3, the distinct peaks suggest that speed_ground is an important predictor for flight distance classification.

The aircraft vs. Y histogram displays only two distinct values, suggesting that aircraft is a binary variable (e.g., two different aircraft types). Since the distribution of Y categories does not show a clear pattern across these values, it appears that aircraft type may not be a strong predictor for classifying flight distances.

The height vs. Y histogram shows a wide spread of values, but there is no clear separation among the different categories of Y. The overlapping distributions suggest that height does not significantly contribute to differentiating between flight distances, making it a weak predictor for Y.

The pitch vs. Y histogram also exhibits nearly identical distributions for all Y categories. The substantial overlap across the Y values suggests that pitch has little impact on flight distance classification, indicating that it may not add much value to the model.

• speed ground is a strong predictor of Y, as its distribution varies significantly across categories.
• Aircraft type may have some influence, but further statistical validation (e.g., chi-square test) is required.
• Height and pitch do not show strong differentiation across categories, suggesting that they may not be useful predictors.

From the above histograms, we can see that with the increase in speed ground above 95 the risk is of overrun is high and there is some association with aircraft. But for height and pitch, we cannot infer any insight about the risk from the distribution.

Question -2:

For the number of passengers is often of interest of airlines, we use poisson distribution.

Let us now see if we can predict the number of passengers on board using other variables.

Let us fit generalised linear model with to predict the number of passengers on board with no_pasg as response variable and distribution being poisson distribution.

## 
## Call:
## glm(formula = no_pasg ~ ., family = poisson, data = FAA)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.0748345  0.0465676  87.504   <2e-16 ***
## aircraft     -0.0013870  0.0105428  -0.132   0.8953    
## speed_ground  0.0004868  0.0004341   1.121   0.2622    
## height        0.0008092  0.0004850   1.668   0.0953 .  
## pitch        -0.0027286  0.0091092  -0.300   0.7645    
## Y            -0.0173879  0.0131619  -1.321   0.1865    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 783.06  on 830  degrees of freedom
## Residual deviance: 779.11  on 825  degrees of freedom
## AIC: 5717.3
## 
## Number of Fisher Scoring iterations: 4
## Start:  AIC=5717.25
## no_pasg ~ aircraft + speed_ground + height + pitch + Y
## 
##                Df Deviance    AIC
## - aircraft      1   779.13 5715.3
## - pitch         1   779.20 5715.3
## - speed_ground  1   780.37 5716.5
## - Y             1   780.86 5717.0
## <none>              779.11 5717.3
## - height        1   781.89 5718.0
## 
## Step:  AIC=5715.27
## no_pasg ~ speed_ground + height + pitch + Y
## 
##                Df Deviance    AIC
## - pitch         1   779.20 5713.3
## - speed_ground  1   780.45 5714.6
## - Y             1   781.06 5715.2
## <none>              779.13 5715.3
## - height        1   781.91 5716.0
## 
## Step:  AIC=5713.34
## no_pasg ~ speed_ground + height + Y
## 
##                Df Deviance    AIC
## - speed_ground  1   780.67 5712.8
## <none>              779.20 5713.3
## - Y             1   781.34 5713.5
## - height        1   782.02 5714.2
## 
## Step:  AIC=5712.81
## no_pasg ~ height + Y
## 
##          Df Deviance    AIC
## - Y       1   781.35 5711.5
## - height  1   782.65 5712.8
## <none>        780.67 5712.8
## 
## Step:  AIC=5711.49
## no_pasg ~ height
## 
##          Df Deviance    AIC
## - height  1   783.06 5711.2
## <none>        781.35 5711.5
## 
## Step:  AIC=5711.2
## no_pasg ~ 1
## 
## Call:  glm(formula = no_pasg ~ 1, family = poisson, data = FAA)
## 
## Coefficients:
## (Intercept)  
##       4.095  
## 
## Degrees of Freedom: 830 Total (i.e. Null);  830 Residual
## Null Deviance:       783.1 
## Residual Deviance: 783.1     AIC: 5711
## Single term deletions
## 
## Model:
## no_pasg ~ aircraft + speed_ground + height + pitch + Y
##              Df Deviance    AIC     LRT Pr(>Chi)  
## <none>            779.11 5717.3                   
## aircraft      1   779.13 5715.3 0.01731   0.8953  
## speed_ground  1   780.37 5716.5 1.25859   0.2619  
## height        1   781.89 5718.0 2.78240   0.0953 .
## pitch         1   779.20 5715.3 0.08973   0.7645  
## Y             1   780.86 5717.0 1.74622   0.1864  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Poisson Model Fit:
• A Poisson regression model was fitted with no_pasg (number of passengers) as the dependent variable and various predictors (aircraft, speed_ground, height, pitch, Y).
• The model’s AIC was 5717.3, and the residual deviance was 779.11.

Coefficient Significance:
• None of the predictor variables are statistically significant (all p-values > 0.05).
• The closest is height (p-value 0.0953), but it is still not significant at the 0.05 level.

Stepwise Selection:
• The stepwise model selection process removed all predictor variables.
• The final model includes only the intercept, suggesting that none of the predictors significantly contribute to predicting no_pasg.

Single-Term Deletion:
• The likelihood ratio test (LRT) confirms that removing any individual variable does not lead to a significant drop in deviance.
• All p-values are above 0.05, reinforcing that these variables do not explain the variation in no_pasg.

From the above models, we can see that using any of the variables in the FAA dataset we cannot predict the number of passengers on board.

The analysis shows that the number of passengers on board follows a Poisson distribution, which is appropriate for count data. However, none of the predictor variables in the FAA dataset, including aircraft type, speed, height, pitch, and variable Y, were statistically significant in explaining the variation in passenger numbers. The stepwise selection process systematically removed all predictors, leaving only the intercept in the final model. This indicates that the available variables do not provide meaningful insights into predicting the number of passengers on board. The likelihood ratio tests further confirm that removing any of these predictors does not significantly affect the model’s performance. Ultimately, the best-fitting model suggests that passenger numbers remain largely unpredictable given the provided dataset.

To improve predictive accuracy, it is necessary to explore additional features that might better explain passenger numbers. Factors such as flight type (domestic vs. international), time of day, day of the week, seasonal trends, weather conditions, and airline capacity could be more relevant predictors. Additionally, it would be beneficial to examine potential data quality issues, missing variables, or measurement errors that could be limiting the model’s effectiveness. If overdispersion is a concern, considering a negative binomial regression could provide better model performance. Further exploratory data analysis (EDA) can also help uncover patterns or transformations that might enhance the predictive power of the model. Finally, incorporating machine learning techniques, such as decision trees or ensemble models, might capture nonlinear relationships that a Poisson regression fails to detect.