Project Overview:

Driven by my passion for aviation and background in Aeronautical Engineering with a minor in Sheet Metal Fabrication, I’ve always been captivated by the capabilities of our amazing flying machines from a tender age; from the tiny Bumble Bee I to the gigantic Antonov An-225 (Mriya). This fascination, particularly with the aerodynamics, avionics, and maneuvers, is why I chose the National Transportation Safety Board (NTSB) aviation accident dataset for my final project course.

Dataset Source:

The NTSB aviation accident database is a comprehensive resource containing information on civil aviation accidents and selected incidents within the United States, its territories and possessions, and those that occurred on international waters. Records date back to 1962 and are continually updated, which gives me the assurance that the dataset is clean, original, well-maintained, and well documented.

The NTSB typically releases a preliminary report online within a few days of an accident and as the investigation progresses and more information is gathered, the preliminary report is replaced with a final report detailing the accident and its probable cause. The dataset is comprehensive enough and it is rich in both numerical and categorical data, offering a broad spectrum of analytic opportunities.

Purpose of the Project:

Apart from been fascinated by the technical part of the aircraft, it is also very important to care about its safety, right? Yes, I am also more concerned to know what went wrong? How to learn from what went wrong?, and how to improve the quality and safety of traveling by airplanes.

This is quite a comprehensive analysis that went over and beyond in unraveling a bunch of insights, I will be reflecting on all the topics that have been covered in this course from Week 1 to this moment, and I ensure that every step is carefully documented and explained properly for easy understanding.

These are the purpose of selecting this dataset and conducting an extensive analysis to give the right answers to my WHAT, HOW, WHY and WHAT NEXT?

Analysis Procedures:

Next is loading the dataset for this project, as well as the necessary libraries for the analysis.

suppressMessages({
  suppressWarnings({
    library(tidyverse)
    library(ggthemes)
    library(dplyr)
    library(lubridate)
    library(tseries)
    library(ggrepel)
    library(readr)
    library(patchwork)
    library(forecast)
    library(broom)
    library(lindia)
    library(car)
    library(caret)
    library(tibble)
    library(MASS)
    
    aviation_accident <- read_csv("global_aviation_accident.csv")  
   })
})

Correlations and interactions:

The objective here is to analyze trends, correlations, and interactions between key variables.

summary_statistics <- summary(dplyr::select(aviation_accident, Total_Fatal_Injuries, Total_Serious_Injuries, Total_Minor_Injuries, Total_Uninjured))
print(summary_statistics)
##  Total_Fatal_Injuries Total_Serious_Injuries Total_Minor_Injuries
##  Min.   : 0.0000      Min.   : 0.0000        Min.   : 0.0000     
##  1st Qu.: 0.0000      1st Qu.: 0.0000        1st Qu.: 0.0000     
##  Median : 0.0000      Median : 0.0000        Median : 0.0000     
##  Mean   : 0.4551      Mean   : 0.1808        Mean   : 0.2925     
##  3rd Qu.: 0.0000      3rd Qu.: 0.0000        3rd Qu.: 0.0000     
##  Max.   :78.0000      Max.   :10.0000        Max.   :25.0000     
##  NA's   :14           NA's   :17             NA's   :16          
##  Total_Uninjured 
##  Min.   :  0.00  
##  1st Qu.:  0.00  
##  Median :  1.00  
##  Mean   :  2.54  
##  3rd Qu.:  2.00  
##  Max.   :182.00  
##  NA's   :3

We can see the distribution of various injury severities and uninjured counts and I can deduce the following:

  1. Most Common Outcomes: The majority of the cases involve no injuries, as indicated by the medians and first quartiles for fatal, serious, and minor injuries all being zero. This suggests that most incidents are non-fatal and non-serious.

  2. Frequency of Uninjured Individuals: Most incidents also seem to have at least one uninjured individual, shown by a median of 1 uninjured person per incident.

  3. Variability in Severity: The maximum values suggest that while most incidents are minor or result in no injuries, there are extreme cases with high numbers of fatal (up to 78 fatalities), serious (up to 10), minor (up to 25), and uninjured (up to 182) individuals.

  4. Data Completeness: There are missing data points across all categories, indicating some gaps in the dataset that could affect the robustness of analyses derived from it.

ggplot(aviation_accident %>% filter(!is.na(Total_Uninjured)), aes(x = Total_Uninjured)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(title = "Histogram of Total Uninjured", x = "Total Uninjured", y = "Frequency")

Th histogram gives me a glimpse on the distribution of the Total Uninjured count, it is heavily skewed to the left, indicating that the majority of the cases have zero or a very low count of uninjured individuals. This suggests that in most of the incidents, there are few or no uninjured people, aligning with what could be inferred from my previous statistics.

aviation_accident %>% 
  filter(!is.na(Total_Fatal_Injuries) & !is.na(Total_Serious_Injuries)) %>%
  ggplot(aes(x = Total_Fatal_Injuries, y = Total_Serious_Injuries)) +
  geom_point(alpha = 0.5) +
  labs(title = "Scatter Plot of Total Serious Injuries vs. Total Fatal Injuries", x = "Total Fatal Injuries", y = "Total Serious Injuries")

Here, I compared Total Serious Injuries with the Total Fatal Injuries for each incident, and most of the data points are clustered near the origin, which suggests that most incidents result in few or no serious or fatal injuries. There are outliers with higher numbers of fatal injuries and, correspondingly, serious injuries, but these are exceptional cases. There is a general trend that incidents with more fatal injuries also tend to have more serious injuries, which is expected in severe accidents.

Probability of Incidents:

Moving on with the analysis, I will compute the probability of incidents by aircraft manufacturers and try to detect anomalies in the number of fatalities.

prob_incidents <- aviation_accident %>%
  group_by(Aircraft_Manufacturer) %>%
  summarise(Incident_Count = n(),
            Total_Incidents = nrow(aviation_accident),
            Probability = Incident_Count / Total_Incidents)
print(prob_incidents)
## # A tibble: 150 × 4
##    Aircraft_Manufacturer Incident_Count Total_Incidents Probability
##    <chr>                          <int>           <int>       <dbl>
##  1 Aero Commander                     3            1394    0.00215 
##  2 Aeronca                            8            1394    0.00574 
##  3 Aerospatiale                       7            1394    0.00502 
##  4 Aerotek-pitts                      1            1394    0.000717
##  5 Air Tractor                        6            1394    0.00430 
##  6 Airmate                            1            1394    0.000717
##  7 American Yankee                    1            1394    0.000717
##  8 Arctic                             1            1394    0.000717
##  9 Avian Balloon                      1            1394    0.000717
## 10 Avres Corp.                        1            1394    0.000717
## # ℹ 140 more rows
fatal_anomalies <- aviation_accident %>%
  filter(Total_Fatal_Injuries > quantile(Total_Fatal_Injuries, 0.95, na.rm = TRUE)) %>%
  arrange(desc(Total_Fatal_Injuries))
print(fatal_anomalies)
## # A tibble: 43 × 16
##    Investigation_Type Event_Date Location            Country     Injury_Severity
##    <chr>              <chr>      <chr>               <chr>       <chr>          
##  1 Accident           13/01/1982 WASHINGTON, DC      United Sta… Fatal(78)      
##  2 Accident           03/01/1982 ASHLAND, VA         United Sta… Fatal(8)       
##  3 Accident           07/02/1982 W. OF HOMESTEAD, FL United Sta… Fatal(8)       
##  4 Accident           07/02/1982 W. OF HOMESTEAD, FL United Sta… Fatal(8)       
##  5 Accident           16/02/1982 SPRINGFIELD, KY     United Sta… Fatal(8)       
##  6 Accident           27/05/1982 CHANDLER, AZ        United Sta… Fatal(8)       
##  7 Accident           24/01/1982 LAREDO, TX          United Sta… Fatal(7)       
##  8 Accident           05/05/1982 CHARLOTTE, TX       United Sta… Fatal(7)       
##  9 Accident           28/02/1982 VERDI, NV           United Sta… Fatal(6)       
## 10 Accident           13/03/1982 GLENDALE, AZ        United Sta… Fatal(6)       
## # ℹ 33 more rows
## # ℹ 11 more variables: Aircraft_Damage <chr>, Aircraft_Category <chr>,
## #   Aircraft_Manufacturer <chr>, Number_of_Engines <dbl>, Engine_Type <chr>,
## #   Flight_Purpose <chr>, Total_Fatal_Injuries <dbl>,
## #   Total_Serious_Injuries <dbl>, Total_Minor_Injuries <dbl>,
## #   Total_Uninjured <dbl>, Flight_Phase <chr>

My interpretation of the coefficients is explained below:

The catastrophic events involving high fatality counts are anomalies within the dataset, while the overall probability of an incident varies by manufacturer but remains a relatively low percentage.

The Distribution Characteristics:

I will examine the distribution characteristics of key metrics by sampling data and creating theoretical and empirical distributions. To do this, I can start by sampling the Total_Uninjured data and then calculate the mean of these samples multiple times to create a distribution of sample means.

According to the CLT, this distribution of sample means should be approximately normally distributed, regardless of the shape of the original data distribution, provided the sample size is sufficiently large (commonly n > 30) and the number of samples is large.

What I aim to achieve here is to generate 1000 sample means from the Total_Uninjured column, each based on a sample of 100 observations (with replacement of course), and then plot a histogram of these sample means.

sample_means <- replicate(1000, mean(sample(aviation_accident$Total_Uninjured, 100, replace = TRUE)))
hist(sample_means, breaks = 30, main = "Sampling Distribution of Total Uninjured", xlab = "Sample Means", col = "green")

This provides us with the sampling distribution of the mean number of uninjured individuals per accident, based on the 1000 samples. The key observations from the histogram are:

Confidence Intervals and Correlations:

I will proceed to building confidence intervals and conduct correlation analyses between some variables, but with the limited variables in this dataset, I will stick with using Number_of_Engines and Total_Fatal_Injuries , since they both are count variables, and it’s reasonable to look into whether there’s an association between them. I will generate bootstrap confidence intervals for the mean number of engines, Chi-Squared approximation test with simulation, and use a Fisher’s Exact Test for count data:

bootstrap_means <- replicate(1000, {
  sample_indices <- sample(1:nrow(aviation_accident), size = 100, replace = TRUE)
  mean_engines <- mean(aviation_accident$Number_of_Engines[sample_indices], na.rm = TRUE)
  return(mean_engines)
})

ci_number_of_engines <- quantile(bootstrap_means, probs = c(0.025, 0.975))

table_total_fatalities_engines <- table(aviation_accident$Total_Fatal_Injuries, aviation_accident$Number_of_Engines)

chisq_test_result <- chisq.test(table_total_fatalities_engines, simulate.p.value = TRUE, B = 2000)

list(
  Confidence_Interval = ci_number_of_engines,
  Chi_Squared_Test_Result = chisq_test_result
)
## $Confidence_Interval
##  2.5% 97.5% 
##  1.08  1.26 
## 
## $Chi_Squared_Test_Result
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  table_total_fatalities_engines
## X-squared = 83.479, df = NA, p-value = 0.06047

There are a lot of data to analyse here but the key points are:

Lest I forget, I need to emphasize that while there is a relationship between the number of engines and the number of fatalities in aviation accidents, the evidence is not strong enough to be statistically conclusive at the traditional 5% level. In the aviation world statistics, statistics indicate that crashes involving multi-engine commercial aircraft have a higher survival rate compared to single-engine aircraft.

Now, Hypothesis Testing:

To test my hypotheses, I will use Neyman-Pearson and Fisher’s significance testing frameworks to conduct these tests below:

aviation_accident$Year <- format(as.Date(aviation_accident$Event_Date, format = "%d/%m/%Y"), "%Y")

# Hypothesis 1: (Average total fatal injuries increased over time)
lm_results <- lm(Total_Fatal_Injuries ~ Year, data = aviation_accident)
summary(lm_results)
## 
## Call:
## lm(formula = Total_Fatal_Injuries ~ Year, data = aviation_accident)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.445 -0.445 -0.445 -0.445 77.555 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.000e+00  2.342e+00   0.854    0.393
## Year1962     2.000e+00  3.312e+00   0.604    0.546
## Year1974     1.000e+00  3.312e+00   0.302    0.763
## Year1977    -8.995e-13  3.312e+00   0.000    1.000
## Year1979    -1.000e+00  3.312e+00  -0.302    0.763
## Year1981     2.000e+00  3.312e+00   0.604    0.546
## Year1982    -1.555e+00  2.343e+00  -0.664    0.507
## 
## Residual standard error: 2.342 on 1373 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.004864,   Adjusted R-squared:  0.0005148 
## F-statistic: 1.118 on 6 and 1373 DF,  p-value: 0.3491
# Hypothesis 2: (No difference in total fatal injuries between Aircraft Categories)
kruskal.test(Total_Fatal_Injuries ~ Aircraft_Category, data = aviation_accident)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Total_Fatal_Injuries by Aircraft_Category
## Kruskal-Wallis chi-squared = 3.6093, df = 4, p-value = 0.4615

To visualize what I explained above, I will create a scatter plot for the linear model with a regression line, and also a boxplot for the fatal injuries by aircraft category to show similar medians with overlapping interquartile range:

Scatter Plot with Regression Line for Total Fatal Injuries Over Time:

aviation_accident_clean <- aviation_accident %>%
  filter(!is.na(Total_Fatal_Injuries), !is.na(Year), is.finite(Total_Fatal_Injuries))

# Scatter plot with regression line
ggplot(aviation_accident_clean, aes(x = as.numeric(Year), y = Total_Fatal_Injuries)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", formula = y ~ x, color = "blue") + 
  theme_minimal() +
  labs(title = "Trend of Total Fatal Injuries Over Time",
       x = "Year",
       y = "Total Fatal Injuries")

My interpretation of the scatter plot goes thus:

Boxplots for Total Fatal Injuries by Aircraft Category:

aviation_accident_clean <- aviation_accident %>%
  filter(!is.na(Total_Fatal_Injuries), !is.na(Aircraft_Category), is.finite(Total_Fatal_Injuries))

ggplot(aviation_accident_clean, aes(x = Aircraft_Category, y = Total_Fatal_Injuries)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Distribution of Total Fatal Injuries by Aircraft Category",
       x = "Aircraft Category",
       y = "Total Fatal Injuries") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The boxplot visualization is quite straightforward and we can clearly see the following:

Regression, Regression, Regression, but Linear:

This will require me to investigate how categorical variables, by performing ANOVA, extending the linear regression model by incorporating multiple variables and checking for multicollinearity.

For example, how categorical variables such as Aircraft_Category influence continuous variables like Total_Fatal_Injuries. Upon doing this, I will extend the linear regression model by incorporating multiple variables and checking for multicollinearity by include variables like Number_of_Engines or and Flight_Phase to ensure that I get to the root of my analysis.

So, to start, I will encode categorical variables using dummy variables and then start with a simple linear model and extend it with multiple variables:

aviation_accident$Aircraft_Category <- as.factor(aviation_accident$Aircraft_Category)
aviation_accident$Flight_Phase <- as.factor(aviation_accident$Flight_Phase)

# Simple linear regression with a categorical variable
lm_single <- lm(Total_Fatal_Injuries ~ Aircraft_Category, data = aviation_accident)
summary(lm_single)
## 
## Call:
## lm(formula = Total_Fatal_Injuries ~ Aircraft_Category, data = aviation_accident)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5000 -0.4018 -0.4018 -0.3193  7.5982 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.40182    0.03027  13.273   <2e-16 ***
## Aircraft_CategoryBalloon    -0.40182    0.33466  -1.201    0.230    
## Aircraft_CategoryGlider     -0.13866    0.24368  -0.569    0.569    
## Aircraft_CategoryGyrocraft   0.09818    0.74587   0.132    0.895    
## Aircraft_CategoryHelicopter -0.08249    0.10125  -0.815    0.415    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 1357 degrees of freedom
##   (32 observations deleted due to missingness)
## Multiple R-squared:  0.001731,   Adjusted R-squared:  -0.001211 
## F-statistic: 0.5884 on 4 and 1357 DF,  p-value: 0.6711
# Multiple linear regression with additional categorical variables
lm_multiple <- lm(Total_Fatal_Injuries ~ Aircraft_Category + Flight_Phase + Number_of_Engines, data = aviation_accident)
summary(lm_multiple)
## 
## Call:
## lm(formula = Total_Fatal_Injuries ~ Aircraft_Category + Flight_Phase + 
##     Number_of_Engines, data = aviation_accident)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8499 -0.4533 -0.2287  0.0069  7.5103 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.19222    0.11451   1.679 0.093455 .  
## Aircraft_CategoryBalloon     0.10962    0.32402   0.338 0.735181    
## Aircraft_CategoryGlider      0.11199    0.24049   0.466 0.641519    
## Aircraft_CategoryGyrocraft   0.06888    0.70188   0.098 0.921841    
## Aircraft_CategoryHelicopter -0.20196    0.09913  -2.037 0.041805 *  
## Flight_PhaseClimb            0.38682    0.17852   2.167 0.030420 *  
## Flight_PhaseCruise           0.41586    0.10343   4.021 6.12e-05 ***
## Flight_PhaseDescent         -0.17104    0.17316  -0.988 0.323448    
## Flight_PhaseGo-around        0.56121    0.26848   2.090 0.036779 *  
## Flight_PhaseLanding         -0.46011    0.09542  -4.822 1.58e-06 ***
## Flight_PhaseManeuvering      0.16013    0.12226   1.310 0.190491    
## Flight_PhaseOther            0.18714    0.45097   0.415 0.678224    
## Flight_PhaseStanding        -0.36095    0.27087  -1.333 0.182913    
## Flight_PhaseTakeoff         -0.22454    0.10253  -2.190 0.028695 *  
## Flight_PhaseTaxi            -0.47692    0.14583  -3.270 0.001101 ** 
## Flight_PhaseUnknown          1.13565    0.30967   3.667 0.000255 ***
## Number_of_Engines            0.26103    0.06217   4.199 2.86e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9907 on 1345 degrees of freedom
##   (32 observations deleted due to missingness)
## Multiple R-squared:  0.1257, Adjusted R-squared:  0.1153 
## F-statistic: 12.09 on 16 and 1345 DF,  p-value: < 2.2e-16

Simple Linear Regression Model (Total Fatal Injuries ~ Aircraft Category)

  • Model Interpretation: The model aimed to predict total fatal injuries based on aircraft category.
  • Coefficients: Most categories do not have a statistically significant effect on total fatal injuries compared to the baseline category (Intercept), except none have p-values < 0.05, implying no significant impact on fatal injuries by changing aircraft categories.
  • Model Fit: With an extremely low R-squared value (0.001731), the model explains virtually none of the variance in total fatal injuries, and the R-squared being negative indicates that the model is unsuitable for making predictions about total fatal injuries based on aircraft category alone.
  • F-statistic and p-value: The overall model is not statistically significant (p-value = 0.6711), reinforcing that aircraft category alone is not a good predictor of total fatal injuries.

Multiple Linear Regression Model (Total Fatal Injuries ~ Aircraft Category + Flight Phase + Number of Engines)

  • Model Interpretation: This extended model incorporates aircraft category, flight phase, and number of engines.
  • Significant Predictors:
    • Flight Phase: Certain phases like Cruise (p < 0.0001) and Landing (p < 1.58e-06) show significant effects, indicating that these phases may be associated with differing levels of fatal injuries.
    • Number of Engines: The positive and significant effect (p < 0.0001), suggest that the number of engines is a significant predictor of fatal injuries, potentially due to the complexity or type of aircraft operations involved.
  • Model Fit: Improved significantly from the simple model, with an R-squared of 0.1257, indicates that about 12.57% of the variance in total fatal injuries can be explained by the model, although still not a high value, but much better compared to the simple model.
  • F-statistic and p-value: The model is statistically significant (p < 2.2e-16), suggesting that the predictors collectively influence total fatal injuries significantly.

To check for multicollinearity which is quite crucial when using multiple predictors in a linear regression, I will use the Variance Inflation Factor to dig deeper into this analysis:

vif(lm_multiple)
##                       GVIF Df GVIF^(1/(2*Df))
## Aircraft_Category 1.261325  4        1.029446
## Flight_Phase      1.117560 11        1.005065
## Number_of_Engines 1.167533  1        1.080524

The Variance Inflation Factor helps us to identify the presence of multicollinearity in regression analysis, I can deduce the following from the above co-coefficients:

  • Aircraft_Category:
    • The Generalized VIF is slightly above 1, which is expected since it is a categorical variable with multiple levels (degrees of freedom, Df = 4).
    • The scaled VIF (GVIF^(1/(2*Df))) is close to 1, at 1.029446, suggesting that multicollinearity is not a significant issue for this predictor.
  • Flight_Phase:
    • This is a low multicollinearity because the GVIF is close to 1.
    • Despite having many levels (Df = 11), the scaled VIF is very close to 1 (1.005065), indicating that it’s unlikely this predictor is causing multicollinearity concerns.
  • Number_of_Engines:
    • With a GVIF slightly above 1 and a scaled VIF of 1.080524, it indicates some multicollinearity but not enough to be generally concerning. A common rule of thumb is that a VIF above 5 or 10 indicates a problematic amount of multicollinearity, and this value is well below those thresholds.

Visualizing the data results:

I will be using the Violin Plot to show each flight phase, grouped by the number of engines, which will also show the distributions of total fatal injuries.

These visualizations will help us in understanding the dynamics of how flight phase and aircraft technical characteristics interact to affect safety outcomes as measured by total fatal injuries.

aviation_accident_clean <- aviation_accident %>%
  filter(is.finite(Total_Fatal_Injuries)) %>%
  group_by(Flight_Phase, Number_of_Engines) %>%
  filter(n() > 1) %>% 
  ungroup()

ggplot(aviation_accident_clean, aes(x = Flight_Phase, y = Total_Fatal_Injuries, fill = as.factor(Number_of_Engines))) +
  geom_violin() +
  labs(title = "Distribution of Total Fatal Injuries by Flight Phase and Number of Engines",
       x = "Flight Phase",
       y = "Total Fatal Injuries") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The visuals from the Violin Plot for the total fatal injuries by flight phase and numbers can be summarized as below:

  • Each violin represents the kernel density estimation of the distribution; but, it appears that the violins are not visible, possibly because of a very narrow distribution.
  • The points are clustered near the bottom of the plot, suggesting that the number of fatal injuries is typically low across all flight phases and engine numbers.
  • There is no clear pattern that shows a significant variance in fatal injuries with the number of engines or flight phase, as most of the data points are zero, which implies low fatal injury counts.
aviation_means <- aviation_accident %>%
  group_by(Flight_Phase, Number_of_Engines) %>%
  summarise(Average_Fatal_Injuries = mean(Total_Fatal_Injuries, na.rm = TRUE), .groups = 'drop')

ggplot(aviation_means, aes(x = Flight_Phase, y = Average_Fatal_Injuries, fill = as.factor(Number_of_Engines))) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Average Total Fatal Injuries by Flight Phase and Number of Engines",
       x = "Flight Phase",
       y = "Average Total Fatal Injuries") +
  scale_fill_brewer(palette = "Pastel1") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

The bars represent the average total fatal injuries for each flight phase, with different colors representing different numbers of engines. Some flight phases, such as Cruise and Unknown show notably higher average fatalities, especially for aircraft with a higher number of engines.

The variance in average fatal injuries across different flight phases suggests that some phases may inherently carry different levels of risk. There is an observable trend where certain phases like Landing and Takeoff have a spread of averages across the different engine numbers, which might indicate an influence of engine count on injury outcomes during these phases.

Plotting diagnostics for the multiple Linear Regression Model:

The below diagnostics plots will help check the assumption of linearity and homoscedasticity. If the points in this plot are randomly dispersed around the horizontal axis, then these assumptions are met.

par(mfrow = c(2, 2)) 
plot(lm_multiple) 

vif(lm_multiple)
##                       GVIF Df GVIF^(1/(2*Df))
## Aircraft_Category 1.261325  4        1.029446
## Flight_Phase      1.117560 11        1.005065
## Number_of_Engines 1.167533  1        1.080524

For the diagnostic plots above, I can observe the following which will make determine whether the assumptions underlying the model are truly met or not:

  • Residuals vs Fitted Plot: The red line is flat, this indicates non-linearity,the spread of residuals is roughly constant across fitted values for homoscedasticity.
  • Normal Q-Q Plot: Points following the dashed line suggests normality, but any severe deviations from the line, especially at the tails, indicate departures from normality.
  • Scale-Location Plot: The horizontal line with randomly spread points indicates homoscedasticity, and funnel shape suggests non-constant variance, i.e. heteroscedasticity.
  • Residuals vs Leverage Plot: The points outside the Cook’s distance lines are considered influential, and the presence of such points suggests that those observations have a large influence on the model’s predictions.

Interpretation of the VIF Results:

  • The VIF results are good with all values near 1, indicating that multicollinearity is likely not a concern for this model.
  • The scaled VIF for Aircraft_Category and Flight_Phase are low, showing that they don’t have multicollinearity issues.
  • Number_of_Engines has a slightly higher VIF but still well below the threshold of concern.

From the observation above, there seem to be issues with both the linearity and homoscedasticity assumptions of the linear regression model, because there are potential outliers that could be disproportionately affecting the model. On the other hand, multicollinearity does not seem to be an issue in the model based on VIF results.

Generalized Linear Models:

To arrive at the model binary outcomes, such as incident occurrence, and using logistic regression, I will expand and compare different models using AIC, BIC, and deviance as criteria. To start with, I will define a binary outcome variable based on whether there were any a fatal incident and then build logistic regression models to predict the likelihood of a fatal incident. I will be using the Number_of_Engines and also the Aircraft_Category which can serve as a good categorical predictor.

To build the logistic models, the binary outcome Fatal_Incident is defined as 1 if there was at least one fatality and 0 otherwise. The initial logistic model glm_fatalities uses Number_of_Engines and Aircraft_Category to predict the probability of a fatal incident.

aviation_accident$Fatal_Incident <- ifelse(aviation_accident$Total_Fatal_Injuries > 0, 1, 0)

# Building the logistic regression model
glm_fatalities <- glm(Fatal_Incident ~ Number_of_Engines + Aircraft_Category, family = binomial, data = aviation_accident)
summary(glm_fatalities)
## 
## Call:
## glm(formula = Fatal_Incident ~ Number_of_Engines + Aircraft_Category, 
##     family = binomial, data = aviation_accident)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.858104   0.196421  -9.460   <2e-16 ***
## Number_of_Engines             0.297386   0.146581   2.029   0.0425 *  
## Aircraft_CategoryBalloon    -13.707964 460.237152  -0.030   0.9762    
## Aircraft_CategoryGlider       0.795273   0.551828   1.441   0.1495    
## Aircraft_CategoryGyrocraft    1.560718   1.416606   1.102   0.2706    
## Aircraft_CategoryHelicopter   0.003891   0.253175   0.015   0.9877    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1295.6  on 1361  degrees of freedom
## Residual deviance: 1285.9  on 1356  degrees of freedom
##   (32 observations deleted due to missingness)
## AIC: 1297.9
## 
## Number of Fisher Scoring iterations: 14
# Expanding model and comparing using AIC and BIC
glm_expanded <- update(glm_fatalities, . ~ . + Flight_Phase)
AIC(glm_fatalities, glm_expanded)
##                df      AIC
## glm_fatalities  6 1297.903
## glm_expanded   17 1075.510
BIC(glm_fatalities, glm_expanded)
##                df      BIC
## glm_fatalities  6 1329.203
## glm_expanded   17 1164.194

The output presents the results of fitting a logistic regression model to predict a binary outcome Fatal_Incident based on the number of engines on an aircraft and the aircraft’s category.

Logistic Regression Model Summary:

Model Comparison (AIC and BIC):

General Interpretation of the GLM:

Time Series Modeling:

I will now analyze time series, using variables such as the frequency of crash and how often does a particular aircraft type crashes.

To perform a time series analysis on this dataset, I will first need to make sure that the data is in the correct format. Since I want to aggregate the data quarterly, I will convert the Event_Date into Date format and extract the quarters from it:

aviation_accident$Date <- as.Date(aviation_accident$Event_Date, format = "%d/%m/%Y")

# Aggregate data by quarter
quarterly_incidents <- aviation_accident %>%
  count(Quarter = floor_date(Date, "quarter"))

ts_data <- ts(quarterly_incidents$n, start = c(year(min(quarterly_incidents$Quarter)), quarter(min(quarterly_incidents$Quarter))), frequency = 4)

auto_arima_model <- auto.arima(ts_data)
summary(auto_arima_model)
## Series: ts_data 
## ARIMA(0,1,0) 
## 
## sigma^2 = 74018:  log likelihood = -49.17
## AIC=100.35   AICc=101.15   BIC=100.3
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
## Training set 83.37512 254.4914 96.37512 5.290773 32.19946 0.2781389 -0.2251853
Acf(ts_data)

Pacf(ts_data)

The output and the provided ACF and PACF plots correspond to the time series analysis of the ts_data using an ARIMA model, and here is what I can refer from these three output:

ARIMA Model Summary:

ACF and PACF Plots:

Sampling and Drawing Conclusions:

For the last part of my analysis, I will examine how sampling variability impacts conclusions using Monte Carlo simulations to understand the impact of sampling on inferential statistics.

To examine the impact of sampling variability on conclusions about total fatalities using the Monte Carlo simulations, I can do it this way:

aviation_accident$Date <- as.Date(aviation_accident$Event_Date, format = "%d/%m/%Y")

annual_fatalities <- aviation_accident %>%
  group_by(Year = format(Date, "%Y")) %>%
  summarise(Total_Fatal_Injuries = sum(Total_Fatal_Injuries, na.rm = TRUE))

set.seed(123) 
simulations <- replicate(100, {
  sample_data <- sample_n(aviation_accident, size = nrow(aviation_accident) / 2, replace = FALSE)
  mean(sample_data$Total_Fatal_Injuries, na.rm = TRUE)
})

hist(simulations, breaks = 30, main = "Distribution of Sample Means from Monte Carlo Simulations")

summary(simulations)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3367  0.4009  0.4444  0.4552  0.5107  0.5754

The histogram and the summary statistics illustrate the distribution of sample means of total fatal injuries calculated from 100 Monte Carlo simulations, which I will explain further below:

Comparisons from the Various Outputs:

So, it is safe to say that the results provide a practical demonstration of the concept of sampling variability and underline the importance of considering confidence intervals when making inferences based on sample data.

General Observations:

  1. Sampling Variability:
    • Monte Carlo simulations highlighted the variability in estimating the mean total fatal injuries, underscoring the natural variability present in sampled data.

Conclusions:

Recommendations:

By leveraging the insights, observations, and recommendations from several statistical analyses and recognizing the inherent uncertainties, the aviation industry can continue to improve its safety protocols, thereby reducing the risk of accidents and enhancing passenger safety.

Data Source:

The National Transportation Safety Board is an independent federal agency charged with investigating civil aviation accident in the United States and significant events in the other modes of transportation.

The dataset was spooled from the official website on https://www.ntsb.gov/_layouts/ntsb.aviation/index.aspx and converted to CSV format which was then used for this comprehensive analysis.

Oluwatosin Agbaakin - 2001292765

Final Project for Statistics for Data Science

04 / 21 / 2024

THANK YOU!!!.