Driven by my passion for aviation and background in Aeronautical
Engineering with a minor in Sheet Metal Fabrication, I’ve always been
captivated by the capabilities of our amazing flying machines from a
tender age; from the tiny Bumble Bee I to
the gigantic Antonov An-225 (Mriya). This
fascination, particularly with the aerodynamics, avionics, and
maneuvers, is why I chose the National Transportation Safety Board
(NTSB) aviation accident dataset for my final project course.
The NTSB aviation accident database is a comprehensive resource containing information on civil aviation accidents and selected incidents within the United States, its territories and possessions, and those that occurred on international waters. Records date back to 1962 and are continually updated, which gives me the assurance that the dataset is clean, original, well-maintained, and well documented.
The NTSB typically releases a preliminary report online within a few days of an accident and as the investigation progresses and more information is gathered, the preliminary report is replaced with a final report detailing the accident and its probable cause. The dataset is comprehensive enough and it is rich in both numerical and categorical data, offering a broad spectrum of analytic opportunities.
Apart from been fascinated by the technical part of the aircraft, it is also very important to care about its safety, right? Yes, I am also more concerned to know what went wrong? How to learn from what went wrong?, and how to improve the quality and safety of traveling by airplanes.
This is quite a comprehensive analysis that went over and beyond in unraveling a bunch of insights, I will be reflecting on all the topics that have been covered in this course from Week 1 to this moment, and I ensure that every step is carefully documented and explained properly for easy understanding.
These are the purpose of selecting this dataset and conducting an extensive analysis to give the right answers to my WHAT, HOW, WHY and WHAT NEXT?
Next is loading the dataset for this project, as well as the necessary libraries for the analysis.
suppressMessages({
suppressWarnings({
library(tidyverse)
library(ggthemes)
library(dplyr)
library(lubridate)
library(tseries)
library(ggrepel)
library(readr)
library(patchwork)
library(forecast)
library(broom)
library(lindia)
library(car)
library(caret)
library(tibble)
library(MASS)
aviation_accident <- read_csv("global_aviation_accident.csv")
})
})
The objective here is to analyze trends, correlations, and interactions between key variables.
summary_statistics <- summary(dplyr::select(aviation_accident, Total_Fatal_Injuries, Total_Serious_Injuries, Total_Minor_Injuries, Total_Uninjured))
print(summary_statistics)
## Total_Fatal_Injuries Total_Serious_Injuries Total_Minor_Injuries
## Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.0000 Median : 0.0000 Median : 0.0000
## Mean : 0.4551 Mean : 0.1808 Mean : 0.2925
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :78.0000 Max. :10.0000 Max. :25.0000
## NA's :14 NA's :17 NA's :16
## Total_Uninjured
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 1.00
## Mean : 2.54
## 3rd Qu.: 2.00
## Max. :182.00
## NA's :3
We can see the distribution of various injury severities and uninjured counts and I can deduce the following:
Most Common Outcomes: The majority of the cases involve no injuries, as indicated by the medians and first quartiles for fatal, serious, and minor injuries all being zero. This suggests that most incidents are non-fatal and non-serious.
Frequency of Uninjured Individuals: Most incidents also seem to have at least one uninjured individual, shown by a median of 1 uninjured person per incident.
Variability in Severity: The maximum values suggest that while most incidents are minor or result in no injuries, there are extreme cases with high numbers of fatal (up to 78 fatalities), serious (up to 10), minor (up to 25), and uninjured (up to 182) individuals.
Data Completeness: There are missing data points across all categories, indicating some gaps in the dataset that could affect the robustness of analyses derived from it.
ggplot(aviation_accident %>% filter(!is.na(Total_Uninjured)), aes(x = Total_Uninjured)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
labs(title = "Histogram of Total Uninjured", x = "Total Uninjured", y = "Frequency")
Th histogram gives me a glimpse on the distribution of the
Total Uninjured count, it is heavily
skewed to the left, indicating that the majority of the cases have zero
or a very low count of uninjured individuals. This suggests that in most
of the incidents, there are few or no uninjured people, aligning with
what could be inferred from my previous statistics.
aviation_accident %>%
filter(!is.na(Total_Fatal_Injuries) & !is.na(Total_Serious_Injuries)) %>%
ggplot(aes(x = Total_Fatal_Injuries, y = Total_Serious_Injuries)) +
geom_point(alpha = 0.5) +
labs(title = "Scatter Plot of Total Serious Injuries vs. Total Fatal Injuries", x = "Total Fatal Injuries", y = "Total Serious Injuries")
Here, I compared Total Serious Injuries
with the Total Fatal Injuries for each
incident, and most of the data points are clustered near the origin,
which suggests that most incidents result in few or no serious or fatal
injuries. There are outliers with higher numbers of fatal injuries and,
correspondingly, serious injuries, but these are exceptional cases.
There is a general trend that incidents with more fatal injuries also
tend to have more serious injuries, which is expected in severe
accidents.
Moving on with the analysis, I will compute the probability of incidents by aircraft manufacturers and try to detect anomalies in the number of fatalities.
prob_incidents <- aviation_accident %>%
group_by(Aircraft_Manufacturer) %>%
summarise(Incident_Count = n(),
Total_Incidents = nrow(aviation_accident),
Probability = Incident_Count / Total_Incidents)
print(prob_incidents)
## # A tibble: 150 × 4
## Aircraft_Manufacturer Incident_Count Total_Incidents Probability
## <chr> <int> <int> <dbl>
## 1 Aero Commander 3 1394 0.00215
## 2 Aeronca 8 1394 0.00574
## 3 Aerospatiale 7 1394 0.00502
## 4 Aerotek-pitts 1 1394 0.000717
## 5 Air Tractor 6 1394 0.00430
## 6 Airmate 1 1394 0.000717
## 7 American Yankee 1 1394 0.000717
## 8 Arctic 1 1394 0.000717
## 9 Avian Balloon 1 1394 0.000717
## 10 Avres Corp. 1 1394 0.000717
## # ℹ 140 more rows
fatal_anomalies <- aviation_accident %>%
filter(Total_Fatal_Injuries > quantile(Total_Fatal_Injuries, 0.95, na.rm = TRUE)) %>%
arrange(desc(Total_Fatal_Injuries))
print(fatal_anomalies)
## # A tibble: 43 × 16
## Investigation_Type Event_Date Location Country Injury_Severity
## <chr> <chr> <chr> <chr> <chr>
## 1 Accident 13/01/1982 WASHINGTON, DC United Sta… Fatal(78)
## 2 Accident 03/01/1982 ASHLAND, VA United Sta… Fatal(8)
## 3 Accident 07/02/1982 W. OF HOMESTEAD, FL United Sta… Fatal(8)
## 4 Accident 07/02/1982 W. OF HOMESTEAD, FL United Sta… Fatal(8)
## 5 Accident 16/02/1982 SPRINGFIELD, KY United Sta… Fatal(8)
## 6 Accident 27/05/1982 CHANDLER, AZ United Sta… Fatal(8)
## 7 Accident 24/01/1982 LAREDO, TX United Sta… Fatal(7)
## 8 Accident 05/05/1982 CHARLOTTE, TX United Sta… Fatal(7)
## 9 Accident 28/02/1982 VERDI, NV United Sta… Fatal(6)
## 10 Accident 13/03/1982 GLENDALE, AZ United Sta… Fatal(6)
## # ℹ 33 more rows
## # ℹ 11 more variables: Aircraft_Damage <chr>, Aircraft_Category <chr>,
## # Aircraft_Manufacturer <chr>, Number_of_Engines <dbl>, Engine_Type <chr>,
## # Flight_Purpose <chr>, Total_Fatal_Injuries <dbl>,
## # Total_Serious_Injuries <dbl>, Total_Minor_Injuries <dbl>,
## # Total_Uninjured <dbl>, Flight_Phase <chr>
My interpretation of the coefficients is explained below:
Incident_Count
associated with different aircraft manufacturers in relation to the
total number of incidents in the dataset
Total_Incidents is
1394.Probability column represents the
likelihood of an incident involving aircraft from each manufacturer. For
instance, Aero Commander has a probability of about
0.215%, and Aeronca has a probability of
about 0.574%.95th
percentile of the Total_Fatal_Injuries
distribution.Investigation_Type,
Event_Date,
Location,
Country,
Injury_Severity , and
Aircraft_Damage for these anomalies.The catastrophic events involving high fatality counts are anomalies within the dataset, while the overall probability of an incident varies by manufacturer but remains a relatively low percentage.
I will examine the distribution characteristics of key metrics by
sampling data and creating theoretical and empirical distributions. To
do this, I can start by sampling the
Total_Uninjured data and then calculate
the mean of these samples multiple times to create a distribution of
sample means.
According to the CLT, this distribution of sample means should be approximately normally distributed, regardless of the shape of the original data distribution, provided the sample size is sufficiently large (commonly n > 30) and the number of samples is large.
What I aim to achieve here is to generate
1000 sample means from the
Total_Uninjured column, each based on a
sample of 100 observations (with replacement of course), and then plot a
histogram of these sample means.
sample_means <- replicate(1000, mean(sample(aviation_accident$Total_Uninjured, 100, replace = TRUE)))
hist(sample_means, breaks = 30, main = "Sampling Distribution of Total Uninjured", xlab = "Sample Means", col = "green")
This provides us with the sampling distribution of the mean number of uninjured individuals per accident, based on the 1000 samples. The key observations from the histogram are:
2 to
3 range, which suggests that the average
number of uninjured individuals per accident falls within this
range.1 to around
7, which reflects the variability in the
number of uninjured individuals across different accidents.I will proceed to building confidence intervals and conduct
correlation analyses between some variables, but with the limited
variables in this dataset, I will stick with using
Number_of_Engines and
Total_Fatal_Injuries , since they both are
count variables, and it’s reasonable to look into whether there’s an
association between them. I will generate bootstrap confidence intervals
for the mean number of engines, Chi-Squared approximation test with
simulation, and use a Fisher’s Exact Test for count data:
bootstrap_means <- replicate(1000, {
sample_indices <- sample(1:nrow(aviation_accident), size = 100, replace = TRUE)
mean_engines <- mean(aviation_accident$Number_of_Engines[sample_indices], na.rm = TRUE)
return(mean_engines)
})
ci_number_of_engines <- quantile(bootstrap_means, probs = c(0.025, 0.975))
table_total_fatalities_engines <- table(aviation_accident$Total_Fatal_Injuries, aviation_accident$Number_of_Engines)
chisq_test_result <- chisq.test(table_total_fatalities_engines, simulate.p.value = TRUE, B = 2000)
list(
Confidence_Interval = ci_number_of_engines,
Chi_Squared_Test_Result = chisq_test_result
)
## $Confidence_Interval
## 2.5% 97.5%
## 1.08 1.26
##
## $Chi_Squared_Test_Result
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: table_total_fatalities_engines
## X-squared = 83.479, df = NA, p-value = 0.06047
There are a lot of data to analyse here but the key points are:
95% confidence interval for the
mean number of engines is between 1.08 and
1.26, which suggests that we can be
95% confident that the true mean number of
engines in the population from which this sample was drawn falls within
this range.1, indicating that most aircraft involved
in incidents have around one engine, on average.83.479 for the Chi-Squared test is
a measure of the difference between the observed frequencies and the
frequencies expected if there was no association between the two
variables.0.05947 is just above
the common alpha level of 0.05 for
statistical significance. This p-value, however, indicates a marginal
relationship between the number of engines and the fatal injuries; it is
close to being significant, suggesting a trend or association, but not
strong enough to conclusively reject the null hypothesis of no
association at the 5% significance level.Lest I forget, I need to emphasize that while there is a relationship between the number of engines and the number of fatalities in aviation accidents, the evidence is not strong enough to be statistically conclusive at the traditional 5% level. In the aviation world statistics, statistics indicate that crashes involving multi-engine commercial aircraft have a higher survival rate compared to single-engine aircraft.
To test my hypotheses, I will use Neyman-Pearson and Fisher’s significance testing frameworks to conduct these tests below:
aviation_accident$Year <- format(as.Date(aviation_accident$Event_Date, format = "%d/%m/%Y"), "%Y")
# Hypothesis 1: (Average total fatal injuries increased over time)
lm_results <- lm(Total_Fatal_Injuries ~ Year, data = aviation_accident)
summary(lm_results)
##
## Call:
## lm(formula = Total_Fatal_Injuries ~ Year, data = aviation_accident)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.445 -0.445 -0.445 -0.445 77.555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.000e+00 2.342e+00 0.854 0.393
## Year1962 2.000e+00 3.312e+00 0.604 0.546
## Year1974 1.000e+00 3.312e+00 0.302 0.763
## Year1977 -8.995e-13 3.312e+00 0.000 1.000
## Year1979 -1.000e+00 3.312e+00 -0.302 0.763
## Year1981 2.000e+00 3.312e+00 0.604 0.546
## Year1982 -1.555e+00 2.343e+00 -0.664 0.507
##
## Residual standard error: 2.342 on 1373 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.004864, Adjusted R-squared: 0.0005148
## F-statistic: 1.118 on 6 and 1373 DF, p-value: 0.3491
# Hypothesis 2: (No difference in total fatal injuries between Aircraft Categories)
kruskal.test(Total_Fatal_Injuries ~ Aircraft_Category, data = aviation_accident)
##
## Kruskal-Wallis rank sum test
##
## data: Total_Fatal_Injuries by Aircraft_Category
## Kruskal-Wallis chi-squared = 3.6093, df = 4, p-value = 0.4615
0.05, indicating that there
is no statistically significant trend in fatal injuries over the
years.(0.3491) indicate that there is not enough
evidence to reject the null hypothesis of no relationship between
Year and
Total_Fatal_Injuries.(0.4615) is above the
0.05 significance level. This suggests
there is no statistically significant difference in the number of total
fatal injuries across different aircraft categories within my dataset,
well NTSB’s dataset :D.To visualize what I explained above, I will create a scatter plot for the linear model with a regression line, and also a boxplot for the fatal injuries by aircraft category to show similar medians with overlapping interquartile range:
Scatter Plot with Regression Line for Total Fatal Injuries Over Time:
aviation_accident_clean <- aviation_accident %>%
filter(!is.na(Total_Fatal_Injuries), !is.na(Year), is.finite(Total_Fatal_Injuries))
# Scatter plot with regression line
ggplot(aviation_accident_clean, aes(x = as.numeric(Year), y = Total_Fatal_Injuries)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = y ~ x, color = "blue") +
theme_minimal() +
labs(title = "Trend of Total Fatal Injuries Over Time",
x = "Year",
y = "Total Fatal Injuries")
My interpretation of the scatter plot goes thus:
Boxplots for Total Fatal Injuries by Aircraft Category:
aviation_accident_clean <- aviation_accident %>%
filter(!is.na(Total_Fatal_Injuries), !is.na(Aircraft_Category), is.finite(Total_Fatal_Injuries))
ggplot(aviation_accident_clean, aes(x = Aircraft_Category, y = Total_Fatal_Injuries)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Distribution of Total Fatal Injuries by Aircraft Category",
x = "Aircraft Category",
y = "Total Fatal Injuries") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The boxplot visualization is quite straightforward and we can clearly see the following:
Helicopter shows a
wider interquartile range and several outliers, suggesting more
variability in fatal injuries compared to other categories.Balloon and
Gyrocraft also show outliers, indicating
occasional incidents with higher numbers of fatalities.This will require me to investigate how categorical variables, by
performing ANOVA, extending the linear
regression model by incorporating multiple variables and checking for
multicollinearity.
For example, how categorical variables such as
Aircraft_Category influence continuous
variables like Total_Fatal_Injuries. Upon
doing this, I will extend the linear regression model by incorporating
multiple variables and checking for multicollinearity by include
variables like Number_of_Engines or and
Flight_Phase to ensure that I get to the
root of my analysis.
So, to start, I will encode categorical variables using dummy variables and then start with a simple linear model and extend it with multiple variables:
aviation_accident$Aircraft_Category <- as.factor(aviation_accident$Aircraft_Category)
aviation_accident$Flight_Phase <- as.factor(aviation_accident$Flight_Phase)
# Simple linear regression with a categorical variable
lm_single <- lm(Total_Fatal_Injuries ~ Aircraft_Category, data = aviation_accident)
summary(lm_single)
##
## Call:
## lm(formula = Total_Fatal_Injuries ~ Aircraft_Category, data = aviation_accident)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5000 -0.4018 -0.4018 -0.3193 7.5982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40182 0.03027 13.273 <2e-16 ***
## Aircraft_CategoryBalloon -0.40182 0.33466 -1.201 0.230
## Aircraft_CategoryGlider -0.13866 0.24368 -0.569 0.569
## Aircraft_CategoryGyrocraft 0.09818 0.74587 0.132 0.895
## Aircraft_CategoryHelicopter -0.08249 0.10125 -0.815 0.415
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 1357 degrees of freedom
## (32 observations deleted due to missingness)
## Multiple R-squared: 0.001731, Adjusted R-squared: -0.001211
## F-statistic: 0.5884 on 4 and 1357 DF, p-value: 0.6711
# Multiple linear regression with additional categorical variables
lm_multiple <- lm(Total_Fatal_Injuries ~ Aircraft_Category + Flight_Phase + Number_of_Engines, data = aviation_accident)
summary(lm_multiple)
##
## Call:
## lm(formula = Total_Fatal_Injuries ~ Aircraft_Category + Flight_Phase +
## Number_of_Engines, data = aviation_accident)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8499 -0.4533 -0.2287 0.0069 7.5103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.19222 0.11451 1.679 0.093455 .
## Aircraft_CategoryBalloon 0.10962 0.32402 0.338 0.735181
## Aircraft_CategoryGlider 0.11199 0.24049 0.466 0.641519
## Aircraft_CategoryGyrocraft 0.06888 0.70188 0.098 0.921841
## Aircraft_CategoryHelicopter -0.20196 0.09913 -2.037 0.041805 *
## Flight_PhaseClimb 0.38682 0.17852 2.167 0.030420 *
## Flight_PhaseCruise 0.41586 0.10343 4.021 6.12e-05 ***
## Flight_PhaseDescent -0.17104 0.17316 -0.988 0.323448
## Flight_PhaseGo-around 0.56121 0.26848 2.090 0.036779 *
## Flight_PhaseLanding -0.46011 0.09542 -4.822 1.58e-06 ***
## Flight_PhaseManeuvering 0.16013 0.12226 1.310 0.190491
## Flight_PhaseOther 0.18714 0.45097 0.415 0.678224
## Flight_PhaseStanding -0.36095 0.27087 -1.333 0.182913
## Flight_PhaseTakeoff -0.22454 0.10253 -2.190 0.028695 *
## Flight_PhaseTaxi -0.47692 0.14583 -3.270 0.001101 **
## Flight_PhaseUnknown 1.13565 0.30967 3.667 0.000255 ***
## Number_of_Engines 0.26103 0.06217 4.199 2.86e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9907 on 1345 degrees of freedom
## (32 observations deleted due to missingness)
## Multiple R-squared: 0.1257, Adjusted R-squared: 0.1153
## F-statistic: 12.09 on 16 and 1345 DF, p-value: < 2.2e-16
(0.001731), the model explains virtually
none of the variance in total fatal injuries, and the R-squared being
negative indicates that the model is unsuitable for making predictions
about total fatal injuries based on aircraft category alone.(p-value = 0.6711), reinforcing that
aircraft category alone is not a good predictor of total fatal
injuries.Cruise (p < 0.0001) and
Landing (p < 1.58e-06) show significant
effects, indicating that these phases may be associated with differing
levels of fatal injuries.0.1257,
indicates that about 12.57% of the
variance in total fatal injuries can be explained by the model, although
still not a high value, but much better compared to the simple
model.(p < 2.2e-16), suggesting
that the predictors collectively influence total fatal injuries
significantly.To check for multicollinearity which is quite crucial when using multiple predictors in a linear regression, I will use the Variance Inflation Factor to dig deeper into this analysis:
vif(lm_multiple)
## GVIF Df GVIF^(1/(2*Df))
## Aircraft_Category 1.261325 4 1.029446
## Flight_Phase 1.117560 11 1.005065
## Number_of_Engines 1.167533 1 1.080524
The Variance Inflation Factor helps us to identify the presence of multicollinearity in regression analysis, I can deduce the following from the above co-coefficients:
1, which is expected since it is a
categorical variable with multiple levels (degrees of freedom, Df =
4).1, at
1.029446, suggesting that
multicollinearity is not a significant issue for this predictor.1.1 (1.005065), indicating that it’s
unlikely this predictor is causing multicollinearity concerns.1 and a
scaled VIF of 1.080524, it indicates some
multicollinearity but not enough to be generally concerning. A common
rule of thumb is that a VIF above 5 or
10 indicates a problematic amount of
multicollinearity, and this value is well below those thresholds.Visualizing the data results:
I will be using the Violin Plot to show each flight phase, grouped by the number of engines, which will also show the distributions of total fatal injuries.
These visualizations will help us in understanding the dynamics of how flight phase and aircraft technical characteristics interact to affect safety outcomes as measured by total fatal injuries.
aviation_accident_clean <- aviation_accident %>%
filter(is.finite(Total_Fatal_Injuries)) %>%
group_by(Flight_Phase, Number_of_Engines) %>%
filter(n() > 1) %>%
ungroup()
ggplot(aviation_accident_clean, aes(x = Flight_Phase, y = Total_Fatal_Injuries, fill = as.factor(Number_of_Engines))) +
geom_violin() +
labs(title = "Distribution of Total Fatal Injuries by Flight Phase and Number of Engines",
x = "Flight Phase",
y = "Total Fatal Injuries") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The visuals from the Violin Plot for the total fatal injuries by flight phase and numbers can be summarized as below:
violin represents the kernel
density estimation of the distribution; but, it appears that the
violins are not visible, possibly because
of a very narrow distribution.aviation_means <- aviation_accident %>%
group_by(Flight_Phase, Number_of_Engines) %>%
summarise(Average_Fatal_Injuries = mean(Total_Fatal_Injuries, na.rm = TRUE), .groups = 'drop')
ggplot(aviation_means, aes(x = Flight_Phase, y = Average_Fatal_Injuries, fill = as.factor(Number_of_Engines))) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Average Total Fatal Injuries by Flight Phase and Number of Engines",
x = "Flight Phase",
y = "Average Total Fatal Injuries") +
scale_fill_brewer(palette = "Pastel1") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The bars represent the average total fatal injuries for each flight
phase, with different colors representing different numbers of engines.
Some flight phases, such as Cruise and
Unknown show notably higher average
fatalities, especially for aircraft with a higher number of engines.
The variance in average fatal injuries across different flight phases
suggests that some phases may inherently carry different levels of risk.
There is an observable trend where certain phases like
Landing and
Takeoff have a spread of averages across
the different engine numbers, which might indicate an influence of
engine count on injury outcomes during these phases.
Plotting diagnostics for the multiple Linear Regression Model:
The below diagnostics plots will help check the assumption of linearity and homoscedasticity. If the points in this plot are randomly dispersed around the horizontal axis, then these assumptions are met.
par(mfrow = c(2, 2))
plot(lm_multiple)
vif(lm_multiple)
## GVIF Df GVIF^(1/(2*Df))
## Aircraft_Category 1.261325 4 1.029446
## Flight_Phase 1.117560 11 1.005065
## Number_of_Engines 1.167533 1 1.080524
For the diagnostic plots above, I can observe the following which will make determine whether the assumptions underlying the model are truly met or not:
Interpretation of the VIF Results:
1, indicating that multicollinearity is
likely not a concern for this model.Aircraft_Category
and Flight_Phase are low, showing that
they don’t have multicollinearity issues.Number_of_Engines has a slightly
higher VIF but still well below the threshold of concern.From the observation above, there seem to be issues with both the linearity and homoscedasticity assumptions of the linear regression model, because there are potential outliers that could be disproportionately affecting the model. On the other hand, multicollinearity does not seem to be an issue in the model based on VIF results.
To arrive at the model binary outcomes, such as incident occurrence,
and using logistic regression, I will expand and compare different
models using AIC, BIC, and deviance as criteria. To start with, I will
define a binary outcome variable based on whether there were any a fatal
incident and then build logistic regression models to predict the
likelihood of a fatal incident. I will be using the
Number_of_Engines and also the
Aircraft_Category which can serve as a
good categorical predictor.
To build the logistic models, the binary outcome
Fatal_Incident is defined as
1 if there was at least one fatality and
0 otherwise. The initial logistic model glm_fatalities
uses Number_of_Engines and
Aircraft_Category to predict the
probability of a fatal incident.
aviation_accident$Fatal_Incident <- ifelse(aviation_accident$Total_Fatal_Injuries > 0, 1, 0)
# Building the logistic regression model
glm_fatalities <- glm(Fatal_Incident ~ Number_of_Engines + Aircraft_Category, family = binomial, data = aviation_accident)
summary(glm_fatalities)
##
## Call:
## glm(formula = Fatal_Incident ~ Number_of_Engines + Aircraft_Category,
## family = binomial, data = aviation_accident)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.858104 0.196421 -9.460 <2e-16 ***
## Number_of_Engines 0.297386 0.146581 2.029 0.0425 *
## Aircraft_CategoryBalloon -13.707964 460.237152 -0.030 0.9762
## Aircraft_CategoryGlider 0.795273 0.551828 1.441 0.1495
## Aircraft_CategoryGyrocraft 1.560718 1.416606 1.102 0.2706
## Aircraft_CategoryHelicopter 0.003891 0.253175 0.015 0.9877
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1295.6 on 1361 degrees of freedom
## Residual deviance: 1285.9 on 1356 degrees of freedom
## (32 observations deleted due to missingness)
## AIC: 1297.9
##
## Number of Fisher Scoring iterations: 14
# Expanding model and comparing using AIC and BIC
glm_expanded <- update(glm_fatalities, . ~ . + Flight_Phase)
AIC(glm_fatalities, glm_expanded)
## df AIC
## glm_fatalities 6 1297.903
## glm_expanded 17 1075.510
BIC(glm_fatalities, glm_expanded)
## df BIC
## glm_fatalities 6 1329.203
## glm_expanded 17 1164.194
The output presents the results of fitting a logistic regression
model to predict a binary outcome
Fatal_Incident based on the number of
engines on an aircraft and the aircraft’s category.
Logistic Regression Model Summary:
Aircraft_Category, the
log odds of a fatal incident are low.(p = 0.0425),
indicating a relationship between the number of engines and the
occurrence of fatal incidents.Aircraft_Category have varying effects,
but none are statistically significant, with large standard errors.Model Comparison (AIC and BIC):
(1075.510) compared to the
initial model (1297.903),
suggesting that including additional predictors significantly improves
the model.(1164.194) compared to the
initial model
(1329.203).General Interpretation of the GLM:
Number_of_Engines suggests that aircraft
with more engines are associated with a higher likelihood of fatal
incidents in the data.I will now analyze time series, using variables such as the frequency of crash and how often does a particular aircraft type crashes.
To perform a time series analysis on this dataset, I will first need to make sure that the data is in the correct format. Since I want to aggregate the data quarterly, I will convert the Event_Date into Date format and extract the quarters from it:
aviation_accident$Date <- as.Date(aviation_accident$Event_Date, format = "%d/%m/%Y")
# Aggregate data by quarter
quarterly_incidents <- aviation_accident %>%
count(Quarter = floor_date(Date, "quarter"))
ts_data <- ts(quarterly_incidents$n, start = c(year(min(quarterly_incidents$Quarter)), quarter(min(quarterly_incidents$Quarter))), frequency = 4)
auto_arima_model <- auto.arima(ts_data)
summary(auto_arima_model)
## Series: ts_data
## ARIMA(0,1,0)
##
## sigma^2 = 74018: log likelihood = -49.17
## AIC=100.35 AICc=101.15 BIC=100.3
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 83.37512 254.4914 96.37512 5.290773 32.19946 0.2781389 -0.2251853
Acf(ts_data)
Pacf(ts_data)
The output and the provided ACF and PACF plots correspond to the time
series analysis of the ts_data using an
ARIMA model, and here is what I can refer from these three output:
ARIMA Model Summary:
74018, which seems quite
large, indicating that there is considerable volatility in the time
series data.-49.17, which measures how well the model fits the data; in
isolation, it doesn’t convey much, but it is used to calculate AIC and
BIC.83.37512, suggesting a positive bias in
predictions, and Root Mean Square Error at
254.4914, which is relatively high,
indicating that the model’s predictions can be off by quite a bit on
average.ACF and PACF Plots:
For the last part of my analysis, I will examine how sampling variability impacts conclusions using Monte Carlo simulations to understand the impact of sampling on inferential statistics.
To examine the impact of sampling variability on conclusions about total fatalities using the Monte Carlo simulations, I can do it this way:
aviation_accident$Date <- as.Date(aviation_accident$Event_Date, format = "%d/%m/%Y")
annual_fatalities <- aviation_accident %>%
group_by(Year = format(Date, "%Y")) %>%
summarise(Total_Fatal_Injuries = sum(Total_Fatal_Injuries, na.rm = TRUE))
set.seed(123)
simulations <- replicate(100, {
sample_data <- sample_n(aviation_accident, size = nrow(aviation_accident) / 2, replace = FALSE)
mean(sample_data$Total_Fatal_Injuries, na.rm = TRUE)
})
hist(simulations, breaks = 30, main = "Distribution of Sample Means from Monte Carlo Simulations")
summary(simulations)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3367 0.4009 0.4444 0.4552 0.5107 0.5754
The histogram and the summary statistics illustrate the distribution of sample means of total fatal injuries calculated from 100 Monte Carlo simulations, which I will explain further below:
Distribution Shape: The histogram displays a roughly symmetric distribution of sample means centered around 0.45. The roughly bell-shaped distribution suggests that the sample means vary around the true mean in a way that’s consistent with the Central Limit Theorem.
Central Tendency:
0.4444, which is a measure of the central
tendency that’s robust to outliers and skewness.0.4552, which is close to the median,
indicating that the distribution of the sample means might be
symmetric.Variability:
0.3367 to
0.5754. This range shows the extent of
variability in the sample means due to random sampling.(0.4009) to the third quartile
(0.5107), contains the middle
50% of the sample means. This spread
indicates the concentration of the distribution around the median.Comparisons from the Various Outputs:
So, it is safe to say that the results provide a practical demonstration of the concept of sampling variability and underline the importance of considering confidence intervals when making inferences based on sample data.
By leveraging the insights, observations, and recommendations from several statistical analyses and recognizing the inherent uncertainties, the aviation industry can continue to improve its safety protocols, thereby reducing the risk of accidents and enhancing passenger safety.
The National Transportation Safety Board is an independent federal agency charged with investigating civil aviation accident in the United States and significant events in the other modes of transportation.
The dataset was spooled from the official website on https://www.ntsb.gov/_layouts/ntsb.aviation/index.aspx and converted to CSV format which was then used for this comprehensive analysis.
04 / 21 / 2024
THANK YOU!!!.