Packages Used
Variable Descriptions
Aircraft: The make of an aircraft (Boeing or
Airbus).
Duration (in minutes): Flight duration
between taking off and landing. The duration of a normal flight should
always be greater than 40min.
No_pasg: The number
of passengers in a flight.
Speed_ground (in miles per
hour): The ground speed of an aircraft when passing over the
threshold of the runway. If its value is less than 30MPH or greater than
140MPH, then the landing would be considered as abnormal.
Speed_air (in miles per hour): The air speed of an
aircraft when passing over the threshold of the runway. If its value is
less than 30MPH or greater than 140MPH, then the landing would be
considered as abnormal.
Height (in meters): The
height of an aircraft when it is passing over the threshold of the
runway. The landing aircraft is required to be at least 6 meters high at
the threshold of the runway.
Pitch (in degrees):
Pitch angle of an aircraft when it is passing over the threshold of the
runway.
Distance (in feet): The landing distance
of an aircraft. More specifically, it refers to the distance between the
threshold of the runway and the point where the aircraft can be fully
stopped. The length of the airport runway is typically less than 6000
feet.
Steps 1-9 of Part-1 of the Project:
Step 1: Reading the datasets
Read the two files ‘FAA-1.xls’ (800 flights) and ‘FAA-2.xls’ into your R system. Please search “Read Excel files from R” in Google in case you do not know how to do that.
Step 2: Let us look at the structure and dimension of the datasets and check whether the datasets datatypes and column names are appropriate or not.
## tibble [800 × 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:800] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:800] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:800] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:800] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:800] 109 103 NA NA NA ...
## $ height : num [1:800] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:800] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:800] 3370 2988 1145 1664 1050 ...
We can observe that there are 800 observations an 8 variables in the faa1 dataset.
## tibble [150 × 7] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:150] "boeing" "boeing" "boeing" "boeing" ...
## $ no_pasg : num [1:150] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:150] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:150] 109 103 NA NA NA ...
## $ height : num [1:150] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:150] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:150] 3370 2988 1145 1664 1050 ...
We can observe that there are 150 observations an 7 variables in the faa2 dataset and we created an empty duration column to merge the datasets.
Step 3: Merging the datasets and checking for duplicate records.
## [1] 950 8
## [1] 100
we can observe that there are no duplicate observations in the merged dataset.
Step 4: Let us look at the structure of the merged dataset, dimension and summary statistics for each variable.
## tibble [850 × 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : chr [1:850] "boeing" "boeing" "boeing" "boeing" ...
## $ duration : num [1:850] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:850] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:850] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:850] 109 103 NA NA NA ...
## $ height : num [1:850] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:850] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:850] 3370 2988 1145 1664 1050 ...
We can observe that there are 850 observations and 8 variables in the merged dataset.
Summary statistics for each variable:
## aircraft duration no_pasg speed_ground
## Length:850 Min. : 14.76 Min. :29.0 Min. : 27.74
## Class :character 1st Qu.:119.49 1st Qu.:55.0 1st Qu.: 65.90
## Mode :character Median :153.95 Median :60.0 Median : 79.64
## Mean :154.01 Mean :60.1 Mean : 79.45
## 3rd Qu.:188.91 3rd Qu.:65.0 3rd Qu.: 92.06
## Max. :305.62 Max. :87.0 Max. :141.22
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. :-3.546 Min. :2.284 Min. : 34.08
## 1st Qu.: 96.25 1st Qu.:23.314 1st Qu.:3.642 1st Qu.: 883.79
## Median :101.15 Median :30.093 Median :4.008 Median :1258.09
## Mean :103.80 Mean :30.144 Mean :4.009 Mean :1526.02
## 3rd Qu.:109.40 3rd Qu.:36.993 3rd Qu.:4.377 3rd Qu.:1936.95
## Max. :141.72 Max. :59.946 Max. :5.927 Max. :6533.05
## NA's :642
## [1] 850 8
we can observe that there are no duplicate observations in the merged dataset.
Let us look at the structure of the merged dataset, dimension and summary statistics for each variable. We can observe that there are 850 observations and 8 variables in the merged dataset.
Step 5:
- There are few abnormal observations in few columns which can be removed further in our analysis.
- There are few missing values in the columns - duration and speed_air.
- The data is balanced as the proportion of data for both the aircrafts are almost the same.
Step 6:
Checking for abnormal values in the variables of the dataset and removing them.
## [1] 850 8
We have removed 23 observations by filtering out the observations with abnormal values.
Step 7:
Let us again look at the structure of the merged dataset, dimension and summary statistics for each variable.
## tibble [831 × 8] (S3: tbl_df/tbl/data.frame)
## $ aircraft : num [1:831] 0 0 0 0 0 0 0 0 0 0 ...
## $ duration : num [1:831] 98.5 125.7 112 196.8 90.1 ...
## $ no_pasg : num [1:831] 53 69 61 56 70 55 54 57 61 56 ...
## $ speed_ground: num [1:831] 107.9 101.7 71.1 85.8 59.9 ...
## $ speed_air : num [1:831] 109 103 NA NA NA ...
## $ height : num [1:831] 27.4 27.8 18.6 30.7 32.4 ...
## $ pitch : num [1:831] 4.04 4.12 4.43 3.88 4.03 ...
## $ distance : num [1:831] 3370 2988 1145 1664 1050 ...
## aircraft duration no_pasg speed_ground
## Min. :0.0000 Min. : 41.95 Min. :29.00 Min. : 33.57
## 1st Qu.:0.0000 1st Qu.:119.63 1st Qu.:55.00 1st Qu.: 66.20
## Median :1.0000 Median :154.28 Median :60.00 Median : 79.79
## Mean :0.5343 Mean :154.78 Mean :60.06 Mean : 79.54
## 3rd Qu.:1.0000 3rd Qu.:189.66 3rd Qu.:65.00 3rd Qu.: 91.91
## Max. :1.0000 Max. :305.62 Max. :87.00 Max. :132.78
## NA's :50
## speed_air height pitch distance
## Min. : 90.00 Min. : 6.228 Min. :2.284 Min. : 41.72
## 1st Qu.: 96.23 1st Qu.:23.530 1st Qu.:3.640 1st Qu.: 893.28
## Median :101.12 Median :30.167 Median :4.001 Median :1262.15
## Mean :103.48 Mean :30.458 Mean :4.005 Mean :1522.48
## 3rd Qu.:109.36 3rd Qu.:37.004 3rd Qu.:4.370 3rd Qu.:1936.63
## Max. :132.91 Max. :59.946 Max. :5.927 Max. :5381.96
## NA's :628
Step 8: Let us plot histograms for all the numerical variables.
Step 9:
- we can observe from the histograms that the columns - duration, no_pasg, Speed_ground, height and pitch are almost symmetric.
- The speed_air column in right skewed from the histogram.
- The distance column is also right skewed.
Part 3
Question -1:
Creating a multinomial variable Y and attaching it to the dataset and discarding the distance variable. Also, duration and speed_air values are removed as it contains missing values and can mislead the interpretation as it influences the AIC value.
The analysis begins with the creation of a multinomial variable
Y
, which categorizes flights based on their distance into
three groups: short-haul (<1000
), medium-haul
(1000-2500
), and long-haul (>=2500
). The
original distance
variable is then removed, as it has been
encoded into Y
. Additionally, variables
duration
and speed_air
are dropped due to
missing values, which could otherwise lead to misleading interpretations
and influence the Akaike Information Criterion (AIC) value in model
selection.
Let us now build a model on the multinomial data.A multinomial
logistic regression model (mmod
) is then built using the
modified dataset, incorporating all remaining variables as predictors of
Y
. The model summary provides insights into the estimated
coefficients and the statistical significance of each variable. To
improve model efficiency and interpretability, stepwise regression
(step(mmod)
) is applied to select the best model based on
AIC. The reduced model (mmodi
) is expected to balance
predictive accuracy and complexity, as a lower AIC indicates a more
optimal model.
## # weights: 21 (12 variable)
## initial value 912.946812
## iter 10 value 550.235799
## iter 20 value 230.358678
## iter 30 value 213.877248
## iter 40 value 212.913549
## iter 50 value 212.082977
## iter 50 value 212.082976
## iter 50 value 212.082976
## final value 212.082976
## converged
## Call:
## multinom(formula = Y ~ ., data = FAA)
##
## Coefficients:
## (Intercept) aircraft no_pasg speed_ground height pitch
## 2 -17.01063 -4.113469 -0.02504101 0.2497376 0.1501956 -0.2454421
## 3 -131.64017 -9.245194 -0.02461303 1.2720814 0.4078697 1.2926411
##
## Std. Errors:
## (Intercept) aircraft no_pasg speed_ground height pitch
## 2 2.06989910 0.4253515 0.01745237 0.02010602 0.01741352 0.2636494
## 3 0.03667468 0.8835064 0.05666093 0.04211307 0.04645304 0.7289671
##
## Residual Deviance: 424.166
## AIC: 448.166
Let us select the model based on AIC.
## Start: AIC=448.17
## Y ~ aircraft + no_pasg + speed_ground + height + pitch
##
## trying - aircraft
## # weights: 18 (10 variable)
## initial value 912.946812
## iter 10 value 498.723154
## iter 20 value 311.411199
## iter 30 value 310.489234
## final value 310.455711
## converged
## trying - no_pasg
## # weights: 18 (10 variable)
## initial value 912.946812
## iter 10 value 399.910023
## iter 20 value 227.526993
## iter 30 value 219.561515
## iter 40 value 213.131118
## final value 213.129119
## converged
## trying - speed_ground
## # weights: 18 (10 variable)
## initial value 912.946812
## iter 10 value 760.358638
## final value 755.295778
## converged
## trying - height
## # weights: 18 (10 variable)
## initial value 912.946812
## iter 10 value 415.116775
## iter 20 value 280.800027
## iter 30 value 279.414846
## iter 40 value 279.375833
## final value 279.375822
## converged
## trying - pitch
## # weights: 18 (10 variable)
## initial value 912.946812
## iter 10 value 393.555153
## iter 20 value 226.292093
## iter 30 value 219.449254
## iter 40 value 214.431376
## final value 214.431034
## converged
## Df AIC
## - no_pasg 10 446.2582
## <none> 12 448.1660
## - pitch 10 448.8621
## - height 10 578.7516
## - aircraft 10 640.9114
## - speed_ground 10 1530.5916
## # weights: 18 (10 variable)
## initial value 912.946812
## iter 10 value 399.910023
## iter 20 value 227.526993
## iter 30 value 219.561515
## iter 40 value 213.131118
## final value 213.129119
## converged
##
## Step: AIC=446.26
## Y ~ aircraft + speed_ground + height + pitch
##
## trying - aircraft
## # weights: 15 (8 variable)
## initial value 912.946812
## iter 10 value 371.812672
## iter 20 value 311.876265
## iter 30 value 311.256957
## final value 311.208398
## converged
## trying - speed_ground
## # weights: 15 (8 variable)
## initial value 912.946812
## iter 10 value 755.684263
## final value 755.576400
## converged
## trying - height
## # weights: 15 (8 variable)
## initial value 912.946812
## iter 10 value 330.387665
## iter 20 value 282.022563
## iter 30 value 280.226638
## final value 280.107902
## converged
## trying - pitch
## # weights: 15 (8 variable)
## initial value 912.946812
## iter 10 value 301.732146
## iter 20 value 228.159485
## iter 30 value 218.475866
## iter 40 value 215.476581
## final value 215.476420
## converged
## Df AIC
## <none> 10 446.2582
## - pitch 8 446.9528
## - height 8 576.2158
## - aircraft 8 638.4168
## - speed_ground 8 1527.1528
To compare the two models, the difference in deviance
(deviance(mmodi) - deviance(mmod)
) and the difference in
degrees of freedom (mmod$edf - mmodi$edf
) are computed. A
Chi-squared goodness-of-fit test is then performed to determine whether
the reduced model fits the data significantly worse than the original
model. The resulting p-value is greater than 0.05, meaning we fail to
reject the null hypothesis. This suggests that the simplified model
retains its predictive performance while reducing unnecessary
complexity.
## Difference in Deviance between the models: 2.092285
## Difference in Degrees of Freedom between the models: 2
## Chi-squared test p-value: 0.3512902
## Since the p-value is greater than 0.05, we fail to reject the null hypothesis.
## The reduced model (mmodi) is preferred as it is not significantly worse than the full model (mmod).
An important observation from the stepwise selection process is that
the pitch
variable does not significantly contribute to
predicting Y
. Its exclusion does not negatively impact the
model’s accuracy, further justifying the decision to retain the reduced
model. Ultimately, the final model (mmodi
) is selected for
its lower AIC value and improved interpretability. This streamlined
model can now be effectively used for categorizing flight distances
while maintaining efficiency and accuracy. From the above Chi Squared
(goodness of fit) test, we can see that the p value is greater than 0.05
i.e., significance level. Hence we can reject the null hypothesis. The
model with AIC is selected among the two models as the AIC value is low
and the model is less complex when compared to the previous model. We
can also see that the model with pitch variable is not adding much value
to the response variable.
## Call:
## multinom(formula = Y ~ aircraft + speed_ground + height + pitch,
## data = FAA)
##
## Coefficients:
## (Intercept) aircraft speed_ground height pitch
## 2 -18.38264 -4.088694 0.248289 0.1483432 -0.2435132
## 3 -133.05937 -9.220876 1.271200 0.4062727 1.2965530
##
## Std. Errors:
## (Intercept) aircraft speed_ground height pitch
## 2 1.86536167 0.4227738 0.01998616 0.01732038 0.2637264
## 3 0.03508062 0.8790709 0.03136898 0.03966359 0.7282156
##
## Residual Deviance: 426.2582
## AIC: 446.2582
Let us plot the histograms for the independent variables against the dependent variable Y and visualize the distribution.
The speed ground vs. Y histogram reveals that higher speeds (above 95) are more strongly associated with Y = 3, indicating that long-haul flights tend to have higher speeds. In contrast, shorter flights (Y = 1) are concentrated in lower speed ranges. Although there is some overlap between Y = 2 and Y = 3, the distinct peaks suggest that speed_ground is an important predictor for flight distance classification.
The aircraft vs. Y histogram displays only two distinct values, suggesting that aircraft is a binary variable (e.g., two different aircraft types). Since the distribution of Y categories does not show a clear pattern across these values, it appears that aircraft type may not be a strong predictor for classifying flight distances.
The height vs. Y histogram shows a wide spread of values, but there is no clear separation among the different categories of Y. The overlapping distributions suggest that height does not significantly contribute to differentiating between flight distances, making it a weak predictor for Y.
The pitch vs. Y histogram also exhibits nearly identical distributions for all Y categories. The substantial overlap across the Y values suggests that pitch has little impact on flight distance classification, indicating that it may not add much value to the model.
• speed ground is a strong predictor of Y, as its distribution varies
significantly across categories.
• Aircraft type may have some
influence, but further statistical validation (e.g., chi-square test) is
required.
• Height and pitch do not show strong differentiation
across categories, suggesting that they may not be useful
predictors.
From the above histograms, we can see that with the increase in speed ground above 95 the risk is of overrun is high and there is some association with aircraft. But for height and pitch, we cannot infer any insight about the risk from the distribution.
Question -2:
For the number of passengers is often of interest of airlines, we use poisson distribution.
Let us now see if we can predict the number of passengers on board using other variables.
Let us fit generalised linear model with to predict the number of passengers on board with no_pasg as response variable and distribution being poisson distribution.
##
## Call:
## glm(formula = no_pasg ~ ., family = poisson, data = FAA)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.0748345 0.0465676 87.504 <2e-16 ***
## aircraft -0.0013870 0.0105428 -0.132 0.8953
## speed_ground 0.0004868 0.0004341 1.121 0.2622
## height 0.0008092 0.0004850 1.668 0.0953 .
## pitch -0.0027286 0.0091092 -0.300 0.7645
## Y -0.0173879 0.0131619 -1.321 0.1865
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 783.06 on 830 degrees of freedom
## Residual deviance: 779.11 on 825 degrees of freedom
## AIC: 5717.3
##
## Number of Fisher Scoring iterations: 4
## Start: AIC=5717.25
## no_pasg ~ aircraft + speed_ground + height + pitch + Y
##
## Df Deviance AIC
## - aircraft 1 779.13 5715.3
## - pitch 1 779.20 5715.3
## - speed_ground 1 780.37 5716.5
## - Y 1 780.86 5717.0
## <none> 779.11 5717.3
## - height 1 781.89 5718.0
##
## Step: AIC=5715.27
## no_pasg ~ speed_ground + height + pitch + Y
##
## Df Deviance AIC
## - pitch 1 779.20 5713.3
## - speed_ground 1 780.45 5714.6
## - Y 1 781.06 5715.2
## <none> 779.13 5715.3
## - height 1 781.91 5716.0
##
## Step: AIC=5713.34
## no_pasg ~ speed_ground + height + Y
##
## Df Deviance AIC
## - speed_ground 1 780.67 5712.8
## <none> 779.20 5713.3
## - Y 1 781.34 5713.5
## - height 1 782.02 5714.2
##
## Step: AIC=5712.81
## no_pasg ~ height + Y
##
## Df Deviance AIC
## - Y 1 781.35 5711.5
## - height 1 782.65 5712.8
## <none> 780.67 5712.8
##
## Step: AIC=5711.49
## no_pasg ~ height
##
## Df Deviance AIC
## - height 1 783.06 5711.2
## <none> 781.35 5711.5
##
## Step: AIC=5711.2
## no_pasg ~ 1
##
## Call: glm(formula = no_pasg ~ 1, family = poisson, data = FAA)
##
## Coefficients:
## (Intercept)
## 4.095
##
## Degrees of Freedom: 830 Total (i.e. Null); 830 Residual
## Null Deviance: 783.1
## Residual Deviance: 783.1 AIC: 5711
## Single term deletions
##
## Model:
## no_pasg ~ aircraft + speed_ground + height + pitch + Y
## Df Deviance AIC LRT Pr(>Chi)
## <none> 779.11 5717.3
## aircraft 1 779.13 5715.3 0.01731 0.8953
## speed_ground 1 780.37 5716.5 1.25859 0.2619
## height 1 781.89 5718.0 2.78240 0.0953 .
## pitch 1 779.20 5715.3 0.08973 0.7645
## Y 1 780.86 5717.0 1.74622 0.1864
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Poisson Model Fit:
• A Poisson regression model was fitted with
no_pasg (number of passengers) as the dependent variable and various
predictors (aircraft, speed_ground, height, pitch, Y).
• The
model’s AIC was 5717.3, and the residual deviance was 779.11.
Coefficient Significance:
• None of the predictor variables are
statistically significant (all p-values > 0.05).
• The closest
is height (p-value 0.0953), but it is still not significant at the 0.05
level.
Stepwise Selection:
• The stepwise model selection process
removed all predictor variables.
• The final model includes only
the intercept, suggesting that none of the predictors significantly
contribute to predicting no_pasg.
Single-Term Deletion:
• The likelihood ratio test (LRT) confirms
that removing any individual variable does not lead to a significant
drop in deviance.
• All p-values are above 0.05, reinforcing that
these variables do not explain the variation in no_pasg.
From the above models, we can see that using any of the variables in the FAA dataset we cannot predict the number of passengers on board.
The analysis shows that the number of passengers on board follows a Poisson distribution, which is appropriate for count data. However, none of the predictor variables in the FAA dataset, including aircraft type, speed, height, pitch, and variable Y, were statistically significant in explaining the variation in passenger numbers. The stepwise selection process systematically removed all predictors, leaving only the intercept in the final model. This indicates that the available variables do not provide meaningful insights into predicting the number of passengers on board. The likelihood ratio tests further confirm that removing any of these predictors does not significantly affect the model’s performance. Ultimately, the best-fitting model suggests that passenger numbers remain largely unpredictable given the provided dataset.
To improve predictive accuracy, it is necessary to explore additional features that might better explain passenger numbers. Factors such as flight type (domestic vs. international), time of day, day of the week, seasonal trends, weather conditions, and airline capacity could be more relevant predictors. Additionally, it would be beneficial to examine potential data quality issues, missing variables, or measurement errors that could be limiting the model’s effectiveness. If overdispersion is a concern, considering a negative binomial regression could provide better model performance. Further exploratory data analysis (EDA) can also help uncover patterns or transformations that might enhance the predictive power of the model. Finally, incorporating machine learning techniques, such as decision trees or ensemble models, might capture nonlinear relationships that a Poisson regression fails to detect.