Statistics for Data Science

Project Overview:

Driven by my passion for aviation and background in Aeronautical Engineering with a minor in Sheet Metal Fabrication, I’ve always been captivated by the capabilities of our amazing flying machines from a tender age; from the tiny Bumble Bee I to the gigantic Antonov An-225 (Mriya). This fascination, particularly with the aerodynamics, avionics, and maneuvers, is why I chose the National Transportation Safety Board (NTSB) aviation accident dataset for my final project course.

Dataset Source:

The NTSB aviation accident database is a comprehensive resource containing information on civil aviation accidents and selected incidents within the United States, its territories and possessions, and those that occurred on international waters. Records date back to 1962 and are continually updated, which gives me the assurance that the dataset is clean, original, well-maintained, and well documented.

The NTSB typically releases a preliminary report online within a few days of an accident and as the investigation progresses and more information is gathered, the preliminary report is replaced with a final report detailing the accident and its probable cause. The dataset is comprehensive enough and it is rich in both numerical and categorical data, offering a broad spectrum of analytic opportunities.

Purpose of the Project:

Apart from been fascinated by the technical part of the aircraft, it is also very important to care about its safety, right? Yes, I am also more concerned to know what went wrong? How to learn from what went wrong?, and how to improve the quality and safety of traveling by airplanes.

This is quite a comprehensive analysis that went over and beyond in unraveling a bunch of insights, I will be reflecting on all the topics that have been covered in this course from Week 1 to this moment, and I ensure that every step is carefully documented and explained properly for easy understanding.

These are the purpose of selecting this dataset and conducting an extensive analysis to give the right answers to my WHAT, HOW, WHY and WHAT NEXT?

Analysis Procedures:

Next is loading the dataset for this project, as well as the necessary libraries for the analysis.

suppressMessages({
  suppressWarnings({
    library(tidyverse)
    library(ggthemes)
    library(dplyr)
    library(lubridate)
    library(tseries)
    library(ggrepel)
    library(readr)
    library(patchwork)
    library(forecast)
    library(broom)
    library(lindia)
    library(car)
    library(caret)
    library(tibble)
    library(MASS)
    
    aviation_accident <- read_csv("global_aviation_accident.csv")  
   })
})

Correlations and interactions:

The objective here is to analyze trends, correlations, and interactions between key variables.

summary_statistics <- summary(dplyr::select(aviation_accident, Total_Fatal_Injuries, Total_Serious_Injuries, Total_Minor_Injuries, Total_Uninjured))
print(summary_statistics)

##  Total_Fatal_Injuries Total_Serious_Injuries Total_Minor_Injuries
##  Min.   : 0.0000      Min.   : 0.0000        Min.   : 0.0000     
##  1st Qu.: 0.0000      1st Qu.: 0.0000        1st Qu.: 0.0000     
##  Median : 0.0000      Median : 0.0000        Median : 0.0000     
##  Mean   : 0.4551      Mean   : 0.1808        Mean   : 0.2925     
##  3rd Qu.: 0.0000      3rd Qu.: 0.0000        3rd Qu.: 0.0000     
##  Max.   :78.0000      Max.   :10.0000        Max.   :25.0000     
##  NA's   :14           NA's   :17             NA's   :16          
##  Total_Uninjured 
##  Min.   :  0.00  
##  1st Qu.:  0.00  
##  Median :  1.00  
##  Mean   :  2.54  
##  3rd Qu.:  2.00  
##  Max.   :182.00  
##  NA's   :3

We can see the distribution of various injury severities and uninjured counts and I can deduce the following:

Most Common Outcomes: The majority of the cases involve no injuries, as indicated by the medians and first quartiles for fatal, serious, and minor injuries all being zero. This suggests that most incidents are non-fatal and non-serious.
Frequency of Uninjured Individuals: Most incidents also seem to have at least one uninjured individual, shown by a median of 1 uninjured person per incident.
Variability in Severity: The maximum values suggest that while most incidents are minor or result in no injuries, there are extreme cases with high numbers of fatal (up to 78 fatalities), serious (up to 10), minor (up to 25), and uninjured (up to 182) individuals.
Data Completeness: There are missing data points across all categories, indicating some gaps in the dataset that could affect the robustness of analyses derived from it.

ggplot(aviation_accident %>% filter(!is.na(Total_Uninjured)), aes(x = Total_Uninjured)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(title = "Histogram of Total Uninjured", x = "Total Uninjured", y = "Frequency")

Th histogram gives me a glimpse on the distribution of the Total Uninjured count, it is heavily skewed to the left, indicating that the majority of the cases have zero or a very low count of uninjured individuals. This suggests that in most of the incidents, there are few or no uninjured people, aligning with what could be inferred from my previous statistics.

aviation_accident %>% 
  filter(!is.na(Total_Fatal_Injuries) & !is.na(Total_Serious_Injuries)) %>%
  ggplot(aes(x = Total_Fatal_Injuries, y = Total_Serious_Injuries)) +
  geom_point(alpha = 0.5) +
  labs(title = "Scatter Plot of Total Serious Injuries vs. Total Fatal Injuries", x = "Total Fatal Injuries", y = "Total Serious Injuries")

Here, I compared Total Serious Injuries with the Total Fatal Injuries for each incident, and most of the data points are clustered near the origin, which suggests that most incidents result in few or no serious or fatal injuries. There are outliers with higher numbers of fatal injuries and, correspondingly, serious injuries, but these are exceptional cases. There is a general trend that incidents with more fatal injuries also tend to have more serious injuries, which is expected in severe accidents.

Probability of Incidents:

Moving on with the analysis, I will compute the probability of incidents by aircraft manufacturers and try to detect anomalies in the number of fatalities.

prob_incidents <- aviation_accident %>%
  group_by(Aircraft_Manufacturer) %>%
  summarise(Incident_Count = n(),
            Total_Incidents = nrow(aviation_accident),
            Probability = Incident_Count / Total_Incidents)
print(prob_incidents)

## # A tibble: 150 × 4
##    Aircraft_Manufacturer Incident_Count Total_Incidents Probability
##    <chr>                          <int>           <int>       <dbl>
##  1 Aero Commander                     3            1394    0.00215 
##  2 Aeronca                            8            1394    0.00574 
##  3 Aerospatiale                       7            1394    0.00502 
##  4 Aerotek-pitts                      1            1394    0.000717
##  5 Air Tractor                        6            1394    0.00430 
##  6 Airmate                            1            1394    0.000717
##  7 American Yankee                    1            1394    0.000717
##  8 Arctic                             1            1394    0.000717
##  9 Avian Balloon                      1            1394    0.000717
## 10 Avres Corp.                        1            1394    0.000717
## # ℹ 140 more rows

fatal_anomalies <- aviation_accident %>%
  filter(Total_Fatal_Injuries > quantile(Total_Fatal_Injuries, 0.95, na.rm = TRUE)) %>%
  arrange(desc(Total_Fatal_Injuries))
print(fatal_anomalies)

## # A tibble: 43 × 16
##    Investigation_Type Event_Date Location            Country     Injury_Severity
##    <chr>              <chr>      <chr>               <chr>       <chr>          
##  1 Accident           13/01/1982 WASHINGTON, DC      United Sta… Fatal(78)      
##  2 Accident           03/01/1982 ASHLAND, VA         United Sta… Fatal(8)       
##  3 Accident           07/02/1982 W. OF HOMESTEAD, FL United Sta… Fatal(8)       
##  4 Accident           07/02/1982 W. OF HOMESTEAD, FL United Sta… Fatal(8)       
##  5 Accident           16/02/1982 SPRINGFIELD, KY     United Sta… Fatal(8)       
##  6 Accident           27/05/1982 CHANDLER, AZ        United Sta… Fatal(8)       
##  7 Accident           24/01/1982 LAREDO, TX          United Sta… Fatal(7)       
##  8 Accident           05/05/1982 CHARLOTTE, TX       United Sta… Fatal(7)       
##  9 Accident           28/02/1982 VERDI, NV           United Sta… Fatal(6)       
## 10 Accident           13/03/1982 GLENDALE, AZ        United Sta… Fatal(6)       
## # ℹ 33 more rows
## # ℹ 11 more variables: Aircraft_Damage <chr>, Aircraft_Category <chr>,
## #   Aircraft_Manufacturer <chr>, Number_of_Engines <dbl>, Engine_Type <chr>,
## #   Flight_Purpose <chr>, Total_Fatal_Injuries <dbl>,
## #   Total_Serious_Injuries <dbl>, Total_Minor_Injuries <dbl>,
## #   Total_Uninjured <dbl>, Flight_Phase <chr>

My interpretation of the coefficients is explained below:

Incident Probability by Aircraft Manufacturer:
- The number of incidents Incident_Count associated with different aircraft manufacturers in relation to the total number of incidents in the dataset Total_Incidents is 1394.
- The Probability column represents the likelihood of an incident involving aircraft from each manufacturer. For instance, Aero Commander has a probability of about 0.215%, and Aeronca has a probability of about 0.574%.
- These probabilities are relatively low, suggesting that incidents are rare events for these manufacturers within the context of the total data considered.
Anomaly Detection for Fatalities:
- The second set show the records with the highest number of fatalities, which exceed the 95th percentile of the Total_Fatal_Injuries distribution.
- The data includes the Investigation_Type, Event_Date, Location, Country, Injury_Severity , and Aircraft_Damage for these anomalies.
- All these incidents resulted in high fatality counts and resulted in the aircraft being destroyed.
- This is a list of particularly severe accidents, likely the most catastrophic events in the dataset.

The catastrophic events involving high fatality counts are anomalies within the dataset, while the overall probability of an incident varies by manufacturer but remains a relatively low percentage.

The Distribution Characteristics:

I will examine the distribution characteristics of key metrics by sampling data and creating theoretical and empirical distributions. To do this, I can start by sampling the Total_Uninjured data and then calculate the mean of these samples multiple times to create a distribution of sample means.

According to the CLT, this distribution of sample means should be approximately normally distributed, regardless of the shape of the original data distribution, provided the sample size is sufficiently large (commonly n > 30) and the number of samples is large.

What I aim to achieve here is to generate 1000 sample means from the Total_Uninjured column, each based on a sample of 100 observations (with replacement of course), and then plot a histogram of these sample means.

sample_means <- replicate(1000, mean(sample(aviation_accident$Total_Uninjured, 100, replace = TRUE)))
hist(sample_means, breaks = 30, main = "Sampling Distribution of Total Uninjured", xlab = "Sample Means", col = "green")

This provides us with the sampling distribution of the mean number of uninjured individuals per accident, based on the 1000 samples. The key observations from the histogram are:

Shape of Distribution: The distribution appears to be roughly bell-shaped, which is consistent with the Central Limit Theorem, which states that the sampling distribution of the sample means will tend to be normally distributed if the sample size is large enough, regardless of the distribution of the underlying data.
Central Tendency: The sample means are mostly concentrated around the 2 to 3 range, which suggests that the average number of uninjured individuals per accident falls within this range.
Spread of Distribution: The range of sample means extends from 1 to around 7, which reflects the variability in the number of uninjured individuals across different accidents.
Skewness: The distribution shows a slight right skew, with a longer tail extending towards the higher number of uninjured individuals, which suggests that there are a few samples with a mean significantly higher than the mode.
Outliers: There seems to be a gradual decline with no distinct outliers in this histogram.
Implication for CLT: Given that the histogram’s shape aligns well with a normal distribution, it supports the CLT’s prediction that with enough sample size, the sampling distribution of the mean will be normal. This allows us to make statistical inferences about the population mean, even if the population distribution is not normal.

Confidence Intervals and Correlations:

I will proceed to building confidence intervals and conduct correlation analyses between some variables, but with the limited variables in this dataset, I will stick with using Number_of_Engines and Total_Fatal_Injuries , since they both are count variables, and it’s reasonable to look into whether there’s an association between them. I will generate bootstrap confidence intervals for the mean number of engines, Chi-Squared approximation test with simulation, and use a Fisher’s Exact Test for count data:

bootstrap_means <- replicate(1000, {
  sample_indices <- sample(1:nrow(aviation_accident), size = 100, replace = TRUE)
  mean_engines <- mean(aviation_accident$Number_of_Engines[sample_indices], na.rm = TRUE)
  return(mean_engines)
})

ci_number_of_engines <- quantile(bootstrap_means, probs = c(0.025, 0.975))

table_total_fatalities_engines <- table(aviation_accident$Total_Fatal_Injuries, aviation_accident$Number_of_Engines)

chisq_test_result <- chisq.test(table_total_fatalities_engines, simulate.p.value = TRUE, B = 2000)

list(
  Confidence_Interval = ci_number_of_engines,
  Chi_Squared_Test_Result = chisq_test_result
)

## $Confidence_Interval
##  2.5% 97.5% 
##  1.08  1.26 
## 
## $Chi_Squared_Test_Result
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  table_total_fatalities_engines
## X-squared = 83.479, df = NA, p-value = 0.06047

There are a lot of data to analyse here but the key points are:

Confidence Interval for Number of Engines:
- The 95% confidence interval for the mean number of engines is between 1.08 and 1.26, which suggests that we can be 95% confident that the true mean number of engines in the population from which this sample was drawn falls within this range.
- The interval does not include values far from 1, indicating that most aircraft involved in incidents have around one engine, on average.
Chi-Squared Test for Independence between Total Fatal Injuries and Number of Engines:
- The 83.479 for the Chi-Squared test is a measure of the difference between the observed frequencies and the frequencies expected if there was no association between the two variables.
- The p-value of 0.05947 is just above the common alpha level of 0.05 for statistical significance. This p-value, however, indicates a marginal relationship between the number of engines and the fatal injuries; it is close to being significant, suggesting a trend or association, but not strong enough to conclusively reject the null hypothesis of no association at the 5% significance level.
- Practically, this might mean there could be some relationship where the number of engines might influence the severity of an accident in terms of fatalities.

Lest I forget, I need to emphasize that while there is a relationship between the number of engines and the number of fatalities in aviation accidents, the evidence is not strong enough to be statistically conclusive at the traditional 5% level. In the aviation world statistics, statistics indicate that crashes involving multi-engine commercial aircraft have a higher survival rate compared to single-engine aircraft.

Now, Hypothesis Testing:

To test my hypotheses, I will use Neyman-Pearson and Fisher’s significance testing frameworks to conduct these tests below:

aviation_accident$Year <- format(as.Date(aviation_accident$Event_Date, format = "%d/%m/%Y"), "%Y")

# Hypothesis 1: (Average total fatal injuries increased over time)
lm_results <- lm(Total_Fatal_Injuries ~ Year, data = aviation_accident)
summary(lm_results)

## 
## Call:
## lm(formula = Total_Fatal_Injuries ~ Year, data = aviation_accident)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.445 -0.445 -0.445 -0.445 77.555 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.000e+00  2.342e+00   0.854    0.393
## Year1962     2.000e+00  3.312e+00   0.604    0.546
## Year1974     1.000e+00  3.312e+00   0.302    0.763
## Year1977    -8.995e-13  3.312e+00   0.000    1.000
## Year1979    -1.000e+00  3.312e+00  -0.302    0.763
## Year1981     2.000e+00  3.312e+00   0.604    0.546
## Year1982    -1.555e+00  2.343e+00  -0.664    0.507
## 
## Residual standard error: 2.342 on 1373 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.004864,   Adjusted R-squared:  0.0005148 
## F-statistic: 1.118 on 6 and 1373 DF,  p-value: 0.3491

# Hypothesis 2: (No difference in total fatal injuries between Aircraft Categories)
kruskal.test(Total_Fatal_Injuries ~ Aircraft_Category, data = aviation_accident)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Total_Fatal_Injuries by Aircraft_Category
## Kruskal-Wallis chi-squared = 3.6093, df = 4, p-value = 0.4615

Linear Model of Total Fatal Injuries Over Time:
- The coefficient estimates show little change in fatal injuries over the years, with no year showing a statistically significant difference from the baseline.
- The p-values for the years are well above the common significance threshold of 0.05, indicating that there is no statistically significant trend in fatal injuries over the years.
- The very low R-squared values indicate that the model explains little of the variance in the number of total fatal injuries. This suggests that the year alone is not a good predictor of fatal injuries.
- The F-statistic and its p-value (0.3491) indicate that there is not enough evidence to reject the null hypothesis of no relationship between Year and Total_Fatal_Injuries.
Kruskal-Wallis Test for Differences in Fatal Injuries Across Aircraft Categories:
- The Kruskal-Wallis chi-squared is relatively low, and the p-value (0.4615) is above the 0.05 significance level. This suggests there is no statistically significant difference in the number of total fatal injuries across different aircraft categories within my dataset, well NTSB’s dataset :D.

To visualize what I explained above, I will create a scatter plot for the linear model with a regression line, and also a boxplot for the fatal injuries by aircraft category to show similar medians with overlapping interquartile range:

Scatter Plot with Regression Line for Total Fatal Injuries Over Time:

aviation_accident_clean <- aviation_accident %>%
  filter(!is.na(Total_Fatal_Injuries), !is.na(Year), is.finite(Total_Fatal_Injuries))

# Scatter plot with regression line
ggplot(aviation_accident_clean, aes(x = as.numeric(Year), y = Total_Fatal_Injuries)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", formula = y ~ x, color = "blue") + 
  theme_minimal() +
  labs(title = "Trend of Total Fatal Injuries Over Time",
       x = "Year",
       y = "Total Fatal Injuries")

My interpretation of the scatter plot goes thus:

The line chart shows us that the total number of fatal injuries per year has generally remained low over time, with most years showing zero or a few fatalities.
The shaded area which I assume represents confidence intervals or perhaps the prediction intervals indicates that the variation in the number of fatal injuries has stayed relatively consistent.
There is a noticeable outlier in recent years, which shows a significant increase in fatal injuries. .

Boxplots for Total Fatal Injuries by Aircraft Category:

aviation_accident_clean <- aviation_accident %>%
  filter(!is.na(Total_Fatal_Injuries), !is.na(Aircraft_Category), is.finite(Total_Fatal_Injuries))

ggplot(aviation_accident_clean, aes(x = Aircraft_Category, y = Total_Fatal_Injuries)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Distribution of Total Fatal Injuries by Aircraft Category",
       x = "Aircraft Category",
       y = "Total Fatal Injuries") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The boxplot visualization is quite straightforward and we can clearly see the following:

The distribution for each aircraft category indicates that the median total fatal injuries are low across all categories, with many categories having a median of zero.
The boxplot for Helicopter shows a wider interquartile range and several outliers, suggesting more variability in fatal injuries compared to other categories.
Aircraft categories like Balloon and Gyrocraft also show outliers, indicating occasional incidents with higher numbers of fatalities.

Regression, Regression, Regression, but Linear:

This will require me to investigate how categorical variables, by performing ANOVA, extending the linear regression model by incorporating multiple variables and checking for multicollinearity.

For example, how categorical variables such as Aircraft_Category influence continuous variables like Total_Fatal_Injuries. Upon doing this, I will extend the linear regression model by incorporating multiple variables and checking for multicollinearity by include variables like Number_of_Engines or and Flight_Phase to ensure that I get to the root of my analysis.

So, to start, I will encode categorical variables using dummy variables and then start with a simple linear model and extend it with multiple variables:

aviation_accident$Aircraft_Category <- as.factor(aviation_accident$Aircraft_Category)
aviation_accident$Flight_Phase <- as.factor(aviation_accident$Flight_Phase)

# Simple linear regression with a categorical variable
lm_single <- lm(Total_Fatal_Injuries ~ Aircraft_Category, data = aviation_accident)
summary(lm_single)

## 
## Call:
## lm(formula = Total_Fatal_Injuries ~ Aircraft_Category, data = aviation_accident)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5000 -0.4018 -0.4018 -0.3193  7.5982 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.40182    0.03027  13.273   <2e-16 ***
## Aircraft_CategoryBalloon    -0.40182    0.33466  -1.201    0.230    
## Aircraft_CategoryGlider     -0.13866    0.24368  -0.569    0.569    
## Aircraft_CategoryGyrocraft   0.09818    0.74587   0.132    0.895    
## Aircraft_CategoryHelicopter -0.08249    0.10125  -0.815    0.415    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 1357 degrees of freedom
##   (32 observations deleted due to missingness)
## Multiple R-squared:  0.001731,   Adjusted R-squared:  -0.001211 
## F-statistic: 0.5884 on 4 and 1357 DF,  p-value: 0.6711

# Multiple linear regression with additional categorical variables
lm_multiple <- lm(Total_Fatal_Injuries ~ Aircraft_Category + Flight_Phase + Number_of_Engines, data = aviation_accident)
summary(lm_multiple)

## 
## Call:
## lm(formula = Total_Fatal_Injuries ~ Aircraft_Category + Flight_Phase + 
##     Number_of_Engines, data = aviation_accident)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8499 -0.4533 -0.2287  0.0069  7.5103 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.19222    0.11451   1.679 0.093455 .  
## Aircraft_CategoryBalloon     0.10962    0.32402   0.338 0.735181    
## Aircraft_CategoryGlider      0.11199    0.24049   0.466 0.641519    
## Aircraft_CategoryGyrocraft   0.06888    0.70188   0.098 0.921841    
## Aircraft_CategoryHelicopter -0.20196    0.09913  -2.037 0.041805 *  
## Flight_PhaseClimb            0.38682    0.17852   2.167 0.030420 *  
## Flight_PhaseCruise           0.41586    0.10343   4.021 6.12e-05 ***
## Flight_PhaseDescent         -0.17104    0.17316  -0.988 0.323448    
## Flight_PhaseGo-around        0.56121    0.26848   2.090 0.036779 *  
## Flight_PhaseLanding         -0.46011    0.09542  -4.822 1.58e-06 ***
## Flight_PhaseManeuvering      0.16013    0.12226   1.310 0.190491    
## Flight_PhaseOther            0.18714    0.45097   0.415 0.678224    
## Flight_PhaseStanding        -0.36095    0.27087  -1.333 0.182913    
## Flight_PhaseTakeoff         -0.22454    0.10253  -2.190 0.028695 *  
## Flight_PhaseTaxi            -0.47692    0.14583  -3.270 0.001101 ** 
## Flight_PhaseUnknown          1.13565    0.30967   3.667 0.000255 ***
## Number_of_Engines            0.26103    0.06217   4.199 2.86e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9907 on 1345 degrees of freedom
##   (32 observations deleted due to missingness)
## Multiple R-squared:  0.1257, Adjusted R-squared:  0.1153 
## F-statistic: 12.09 on 16 and 1345 DF,  p-value: < 2.2e-16

Simple Linear Regression Model (Total Fatal Injuries ~ Aircraft Category)

Model Interpretation: The model aimed to predict total fatal injuries based on aircraft category.
Coefficients: Most categories do not have a statistically significant effect on total fatal injuries compared to the baseline category (Intercept), except none have p-values < 0.05, implying no significant impact on fatal injuries by changing aircraft categories.
Model Fit: With an extremely low R-squared value (0.001731), the model explains virtually none of the variance in total fatal injuries, and the R-squared being negative indicates that the model is unsuitable for making predictions about total fatal injuries based on aircraft category alone.
F-statistic and p-value: The overall model is not statistically significant (p-value = 0.6711), reinforcing that aircraft category alone is not a good predictor of total fatal injuries.

Multiple Linear Regression Model (Total Fatal Injuries ~ Aircraft Category + Flight Phase + Number of Engines)

Model Interpretation: This extended model incorporates aircraft category, flight phase, and number of engines.
Significant Predictors:
- Flight Phase: Certain phases like Cruise (p < 0.0001) and Landing (p < 1.58e-06) show significant effects, indicating that these phases may be associated with differing levels of fatal injuries.
- Number of Engines: The positive and significant effect (p < 0.0001), suggest that the number of engines is a significant predictor of fatal injuries, potentially due to the complexity or type of aircraft operations involved.
Model Fit: Improved significantly from the simple model, with an R-squared of 0.1257, indicates that about 12.57% of the variance in total fatal injuries can be explained by the model, although still not a high value, but much better compared to the simple model.
F-statistic and p-value: The model is statistically significant (p < 2.2e-16), suggesting that the predictors collectively influence total fatal injuries significantly.

To check for multicollinearity which is quite crucial when using multiple predictors in a linear regression, I will use the Variance Inflation Factor to dig deeper into this analysis:

vif(lm_multiple)

##                       GVIF Df GVIF^(1/(2*Df))
## Aircraft_Category 1.261325  4        1.029446
## Flight_Phase      1.117560 11        1.005065
## Number_of_Engines 1.167533  1        1.080524

The Variance Inflation Factor helps us to identify the presence of multicollinearity in regression analysis, I can deduce the following from the above co-coefficients:

Aircraft_Category:
- The Generalized VIF is slightly above 1, which is expected since it is a categorical variable with multiple levels (degrees of freedom, Df = 4).
- The scaled VIF (GVIF^(1/(2*Df))) is close to 1, at 1.029446, suggesting that multicollinearity is not a significant issue for this predictor.
Flight_Phase:
- This is a low multicollinearity because the GVIF is close to 1.
- Despite having many levels (Df = 11), the scaled VIF is very close to 1 (1.005065), indicating that it’s unlikely this predictor is causing multicollinearity concerns.
Number_of_Engines:
- With a GVIF slightly above 1 and a scaled VIF of 1.080524, it indicates some multicollinearity but not enough to be generally concerning. A common rule of thumb is that a VIF above 5 or 10 indicates a problematic amount of multicollinearity, and this value is well below those thresholds.

Visualizing the data results:

I will be using the Violin Plot to show each flight phase, grouped by the number of engines, which will also show the distributions of total fatal injuries.

These visualizations will help us in understanding the dynamics of how flight phase and aircraft technical characteristics interact to affect safety outcomes as measured by total fatal injuries.

aviation_accident_clean <- aviation_accident %>%
  filter(is.finite(Total_Fatal_Injuries)) %>%
  group_by(Flight_Phase, Number_of_Engines) %>%
  filter(n() > 1) %>% 
  ungroup()

ggplot(aviation_accident_clean, aes(x = Flight_Phase, y = Total_Fatal_Injuries, fill = as.factor(Number_of_Engines))) +
  geom_violin() +
  labs(title = "Distribution of Total Fatal Injuries by Flight Phase and Number of Engines",
       x = "Flight Phase",
       y = "Total Fatal Injuries") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The visuals from the Violin Plot for the total fatal injuries by flight phase and numbers can be summarized as below:

Each violin represents the kernel density estimation of the distribution; but, it appears that the violins are not visible, possibly because of a very narrow distribution.
The points are clustered near the bottom of the plot, suggesting that the number of fatal injuries is typically low across all flight phases and engine numbers.
There is no clear pattern that shows a significant variance in fatal injuries with the number of engines or flight phase, as most of the data points are zero, which implies low fatal injury counts.

aviation_means <- aviation_accident %>%
  group_by(Flight_Phase, Number_of_Engines) %>%
  summarise(Average_Fatal_Injuries = mean(Total_Fatal_Injuries, na.rm = TRUE), .groups = 'drop')

ggplot(aviation_means, aes(x = Flight_Phase, y = Average_Fatal_Injuries, fill = as.factor(Number_of_Engines))) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Average Total Fatal Injuries by Flight Phase and Number of Engines",
       x = "Flight Phase",
       y = "Average Total Fatal Injuries") +
  scale_fill_brewer(palette = "Pastel1") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The bars represent the average total fatal injuries for each flight phase, with different colors representing different numbers of engines. Some flight phases, such as Cruise and Unknown show notably higher average fatalities, especially for aircraft with a higher number of engines.

The variance in average fatal injuries across different flight phases suggests that some phases may inherently carry different levels of risk. There is an observable trend where certain phases like Landing and Takeoff have a spread of averages across the different engine numbers, which might indicate an influence of engine count on injury outcomes during these phases.

Plotting diagnostics for the multiple Linear Regression Model:

The below diagnostics plots will help check the assumption of linearity and homoscedasticity. If the points in this plot are randomly dispersed around the horizontal axis, then these assumptions are met.

par(mfrow = c(2, 2)) 
plot(lm_multiple)

vif(lm_multiple)

##                       GVIF Df GVIF^(1/(2*Df))
## Aircraft_Category 1.261325  4        1.029446
## Flight_Phase      1.117560 11        1.005065
## Number_of_Engines 1.167533  1        1.080524

For the diagnostic plots above, I can observe the following which will make determine whether the assumptions underlying the model are truly met or not:

Residuals vs Fitted Plot: The red line is flat, this indicates non-linearity,the spread of residuals is roughly constant across fitted values for homoscedasticity.
Normal Q-Q Plot: Points following the dashed line suggests normality, but any severe deviations from the line, especially at the tails, indicate departures from normality.
Scale-Location Plot: The horizontal line with randomly spread points indicates homoscedasticity, and funnel shape suggests non-constant variance, i.e. heteroscedasticity.
Residuals vs Leverage Plot: The points outside the Cook’s distance lines are considered influential, and the presence of such points suggests that those observations have a large influence on the model’s predictions.

Interpretation of the VIF Results:

The VIF results are good with all values near 1, indicating that multicollinearity is likely not a concern for this model.
The scaled VIF for Aircraft_Category and Flight_Phase are low, showing that they don’t have multicollinearity issues.
Number_of_Engines has a slightly higher VIF but still well below the threshold of concern.

From the observation above, there seem to be issues with both the linearity and homoscedasticity assumptions of the linear regression model, because there are potential outliers that could be disproportionately affecting the model. On the other hand, multicollinearity does not seem to be an issue in the model based on VIF results.

Generalized Linear Models:

To arrive at the model binary outcomes, such as incident occurrence, and using logistic regression, I will expand and compare different models using AIC, BIC, and deviance as criteria. To start with, I will define a binary outcome variable based on whether there were any a fatal incident and then build logistic regression models to predict the likelihood of a fatal incident. I will be using the Number_of_Engines and also the Aircraft_Category which can serve as a good categorical predictor.

To build the logistic models, the binary outcome Fatal_Incident is defined as 1 if there was at least one fatality and 0 otherwise. The initial logistic model glm_fatalities uses Number_of_Engines and Aircraft_Category to predict the probability of a fatal incident.

aviation_accident$Fatal_Incident <- ifelse(aviation_accident$Total_Fatal_Injuries > 0, 1, 0)

# Building the logistic regression model
glm_fatalities <- glm(Fatal_Incident ~ Number_of_Engines + Aircraft_Category, family = binomial, data = aviation_accident)
summary(glm_fatalities)

## 
## Call:
## glm(formula = Fatal_Incident ~ Number_of_Engines + Aircraft_Category, 
##     family = binomial, data = aviation_accident)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.858104   0.196421  -9.460   <2e-16 ***
## Number_of_Engines             0.297386   0.146581   2.029   0.0425 *  
## Aircraft_CategoryBalloon    -13.707964 460.237152  -0.030   0.9762    
## Aircraft_CategoryGlider       0.795273   0.551828   1.441   0.1495    
## Aircraft_CategoryGyrocraft    1.560718   1.416606   1.102   0.2706    
## Aircraft_CategoryHelicopter   0.003891   0.253175   0.015   0.9877    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1295.6  on 1361  degrees of freedom
## Residual deviance: 1285.9  on 1356  degrees of freedom
##   (32 observations deleted due to missingness)
## AIC: 1297.9
## 
## Number of Fisher Scoring iterations: 14

# Expanding model and comparing using AIC and BIC
glm_expanded <- update(glm_fatalities, . ~ . + Flight_Phase)
AIC(glm_fatalities, glm_expanded)

##                df      AIC
## glm_fatalities  6 1297.903
## glm_expanded   17 1075.510

BIC(glm_fatalities, glm_expanded)

##                df      BIC
## glm_fatalities  6 1329.203
## glm_expanded   17 1164.194

The output presents the results of fitting a logistic regression model to predict a binary outcome Fatal_Incident based on the number of engines on an aircraft and the aircraft’s category.

Logistic Regression Model Summary:

Intercept: The model’s intercept is significantly negative (Estimate = -1.858104, p < 0.0001), suggesting that when the number of engines is zero (though not practically possible) and the reference level for Aircraft_Category, the log odds of a fatal incident are low.
Number_of_Engines: For every additional engine, the log odds of having a fatal incident increase by 0.297386. This predictor is significant (p = 0.0425), indicating a relationship between the number of engines and the occurrence of fatal incidents.
Aircraft_Category: The different levels of Aircraft_Category have varying effects, but none are statistically significant, with large standard errors.
Model Fit: The null deviance and residual deviance are measures of goodness of fit. The closer these are, the better the model fits. Here, the small decrease from null to residual deviance suggests that the predictors do not explain much of the variation in fatal incidents.
AIC: The AIC decreased substantially in the expanded model, suggesting better explanatory power.
Number of Fisher Scoring iterations: Indicates the number of iterations taken to converge on a solution, a higher number indicates a more complex model.

Model Comparison (AIC and BIC):

AIC: The expanded model has a much lower AIC (1075.510) compared to the initial model (1297.903), suggesting that including additional predictors significantly improves the model.
BIC (Bayesian Information Criterion): Similar to AIC, but includes a penalty for the number of parameters in the model to avoid overfitting. A lower BIC indicates a better model, and again, the expanded model shows a better fit with a lower BIC (1164.194) compared to the initial model (1329.203).

General Interpretation of the GLM:

The significant positive coefficient for Number_of_Engines suggests that aircraft with more engines are associated with a higher likelihood of fatal incidents in the data.
The aircraft categories, with their large standard errors and insignificant p-values, may not be reliable predictors in this model, potentially due to limited data or variability within those categories.
The improved AIC and BIC in the expanded model suggest that including more variables provides a better fit for predicting fatal incidents.
The comparison of AIC and BIC suggests that expanding the model was beneficial.

Time Series Modeling:

I will now analyze time series, using variables such as the frequency of crash and how often does a particular aircraft type crashes.

To perform a time series analysis on this dataset, I will first need to make sure that the data is in the correct format. Since I want to aggregate the data quarterly, I will convert the Event_Date into Date format and extract the quarters from it:

aviation_accident$Date <- as.Date(aviation_accident$Event_Date, format = "%d/%m/%Y")

# Aggregate data by quarter
quarterly_incidents <- aviation_accident %>%
  count(Quarter = floor_date(Date, "quarter"))

ts_data <- ts(quarterly_incidents$n, start = c(year(min(quarterly_incidents$Quarter)), quarter(min(quarterly_incidents$Quarter))), frequency = 4)

auto_arima_model <- auto.arima(ts_data)
summary(auto_arima_model)

## Series: ts_data 
## ARIMA(0,1,0) 
## 
## sigma^2 = 74018:  log likelihood = -49.17
## AIC=100.35   AICc=101.15   BIC=100.3
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
## Training set 83.37512 254.4914 96.37512 5.290773 32.19946 0.2781389 -0.2251853

Acf(ts_data)

Pacf(ts_data)

The output and the provided ACF and PACF plots correspond to the time series analysis of the ts_data using an ARIMA model, and here is what I can refer from these three output:

ARIMA Model Summary:

sigma^2: This is the variance of the errors, estimated at 74018, which seems quite large, indicating that there is considerable volatility in the time series data.
log likelihood: The log likelihood of -49.17, which measures how well the model fits the data; in isolation, it doesn’t convey much, but it is used to calculate AIC and BIC.
AIC and BIC: The AIC and BIC are both measures for model selection where lower values indicate a better-fitting model. These values can be used to compare this model against other models.
Error measures: With Mean Error at 83.37512, suggesting a positive bias in predictions, and Root Mean Square Error at 254.4914, which is relatively high, indicating that the model’s predictions can be off by quite a bit on average.

ACF and PACF Plots:

ACF Plot: This shows the autocorrelation of the series with its lagged values. The lack of significant spikes outside the confidence intervals at lags other than 0 indicates no clear autocorrelation in the differenced series.
PACF Plot: The partial autocorrelation which is the correlation of the series with its lagged values not accounted for by prior lags. Like the ACF plot, the lack of significant spikes indicates no strong autoregressive behavior in the data.

Sampling and Drawing Conclusions:

For the last part of my analysis, I will examine how sampling variability impacts conclusions using Monte Carlo simulations to understand the impact of sampling on inferential statistics.

To examine the impact of sampling variability on conclusions about total fatalities using the Monte Carlo simulations, I can do it this way:

aviation_accident$Date <- as.Date(aviation_accident$Event_Date, format = "%d/%m/%Y")

annual_fatalities <- aviation_accident %>%
  group_by(Year = format(Date, "%Y")) %>%
  summarise(Total_Fatal_Injuries = sum(Total_Fatal_Injuries, na.rm = TRUE))

set.seed(123) 
simulations <- replicate(100, {
  sample_data <- sample_n(aviation_accident, size = nrow(aviation_accident) / 2, replace = FALSE)
  mean(sample_data$Total_Fatal_Injuries, na.rm = TRUE)
})

hist(simulations, breaks = 30, main = "Distribution of Sample Means from Monte Carlo Simulations")

summary(simulations)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3367  0.4009  0.4444  0.4552  0.5107  0.5754

The histogram and the summary statistics illustrate the distribution of sample means of total fatal injuries calculated from 100 Monte Carlo simulations, which I will explain further below:

Distribution Shape: The histogram displays a roughly symmetric distribution of sample means centered around 0.45. The roughly bell-shaped distribution suggests that the sample means vary around the true mean in a way that’s consistent with the Central Limit Theorem.
Central Tendency:
- Median: The median of the simulations is 0.4444, which is a measure of the central tendency that’s robust to outliers and skewness.
- Mean: The mean of the simulations is 0.4552, which is close to the median, indicating that the distribution of the sample means might be symmetric.
Variability:
- Minimum and Maximum: The range of the simulated sample means is from 0.3367 to 0.5754. This range shows the extent of variability in the sample means due to random sampling.
- Interquartile Range (IQR): The IQR, from the first quartile (0.4009) to the third quartile (0.5107), contains the middle 50% of the sample means. This spread indicates the concentration of the distribution around the median.

Comparisons from the Various Outputs:

The spread of the sample means indicates that there is sampling variability.
The fact that the mean and median are very close together and the distribution appears symmetric suggests that the sampling distribution of the mean may be approximating a normal distribution, in line with theoretical expectations.
There are no extreme outliers in the simulation results, which implies that individual random samples are likely to provide a reasonable estimate of the population mean, with the provided variability range.

So, it is safe to say that the results provide a practical demonstration of the concept of sampling variability and underline the importance of considering confidence intervals when making inferences based on sample data.

General Observations:

Injury and Fatality Data:
- Most incidents resulted in no or minor injuries, with a median of 1 uninjured person per incident.
- Fatalities and serious injuries are less common, but when they occur, they can be significant.
Aircraft and Flight Phases:
- The number of engines on an aircraft was associated with the likelihood of a fatal incident, with more engines showing a higher likelihood.
- Certain flight phases like cruising and landing had higher average fatalities, suggesting these phases are more prone to severe incidents.
Modeling and Predictive Insights:
- Linear regression models indicated that the year alone is not a strong predictor of fatal injuries.
- Logistic regression analysis did not find strong evidence that aircraft categories influence the occurrence of fatal incidents significantly.
- Time series analysis did not show clear seasonal or cyclic patterns in the data, indicating that incidents may not follow predictable temporal trends.

Sampling Variability:
- Monte Carlo simulations highlighted the variability in estimating the mean total fatal injuries, underscoring the natural variability present in sampled data.

Conclusions:

Generally, it is common knowledge that aviation safety have improved overtime, and this analysis truly sometimes that ‘common knowledge’ which shows us that truly, aviation safety have improved drastically,with no upward trend in fatal injuries, although the data suggests that certain types of aircraft and flight phases may require more attention due to higher associated risks.
My analysis suggests that while the type of aircraft which I can label as number of engines is significant, the category is not a predictor of accidents, implying that other factors, such as operational ones, might be more critical.
Fatal aviation incidents are relatively infrequent but can vary significantly when they do occur, with some categories more prone to higher numbers of fatalities.
The recent spike in fatalities may suggest a deviation from the norm and should be a focus for safety reviews or policy responses.
Helicopters and certain other categories might warrant targeted safety measures due to the observed variability in fatal injury incidents.

Recommendations:

Focused Safety Measures:
- Safety measures should particularly target identified high-risk phases of flight, like cruising and landing.
- Aircraft with more engines might require additional safety protocols or maintenance checks.
Data-Driven Safety Enhancements:
- Further investigation into the specific conditions and factors that lead to higher fatality rates during different flight phases.
- Authorities should implement data monitoring systems to continually update risk models and identify new patterns as they emerge.
Ongoing Research and Development:
- Regulatory bodies needs to invest heavily in research to understand the causative factors behind the statistical observations, especially for aircraft with more engines.
- Exploring the use of predictive analytics and machine learning to preemptively identify risk factors and mitigate them should be employed by enforcement and regulatory bodies.
Sampling Considerations in Policy Making:
- The use of statistical confidence intervals and hypothesis testing when using sampled data to make policy decisions is important to ensure that natural sampling variability is accounted for.

By leveraging the insights, observations, and recommendations from several statistical analyses and recognizing the inherent uncertainties, the aviation industry can continue to improve its safety protocols, thereby reducing the risk of accidents and enhancing passenger safety.

Data Source:

The National Transportation Safety Board is an independent federal agency charged with investigating civil aviation accident in the United States and significant events in the other modes of transportation.

The dataset was spooled from the official website on https://www.ntsb.gov/_layouts/ntsb.aviation/index.aspx and converted to CSV format which was then used for this comprehensive analysis.

Oluwatosin Agbaakin - 2001292765

Final Project for Statistics for Data Science

04 / 21 / 2024

THANK YOU!!!.

Statistics for Data Science - Final Project

Aviation Accident Analysis

Oluwatosin Agbaakin

2024-04-13