Project Overview

This project presents an end-to-end data-driven analysis of aviation accidents in the United States, using real-world data from the National Transportation Safety Board (NTSB).
The study focuses on exploring accident patterns, severity factors, and the statistical relationships between aircraft, weather, operational conditions, and outcomes.

Core Objective

To uncover actionable insights that help understand the causes, frequency, and severity of aviation incidents, while identifying factors that significantly influence accident outcomes.

Analytical Approach

The project employs a systematic approach involving: 1. Data Cleaning & Preprocessing — Handling missing data, standardizing formats, and engineering critical variables such as FatalityRate and IncidentSeverity. 2. Exploratory Data Analysis (EDA) — Understanding patterns through visualizations and grouping by aircraft type, location, and weather conditions. 3. Statistical Testing (t-test, Regression, ANOVA) — Examining relationships between continuous and categorical variables like: - Engine count vs fatality rate
- Weather and purpose of flight vs accident severity
- Aircraft damage levels and their impact on fatalities
4. Clustering (K-Means) — Grouping incidents by severity and injury counts to identify natural patterns and high-risk clusters. 5. Classification (KNN) — Predicting severity categories and incident occurrence at airports using normalized feature data. 6. Association Rule Mining (Apriori) — Discovering interrelated patterns between flight purpose, weather, and damage type to reveal root-cause linkages.

Key Variables Analyzed

Operational: Purpose of Flight, Weather Condition, Number of Engines
Geographic: State, Latitude, Longitude
Outcome-based: FatalityRate, IncidentSeverity, AircraftDamage
Categorical Factors: AirCraftCategory, AtAirport

Project Scope

The project spans 34 analytical questions covering: - Accident trends and distribution across years and locations
- Effect of flight purpose, weather, and aircraft type on fatalities
- Statistical modeling and clustering for predictive insights
- Rule-based analysis to uncover hidden correlations

Outcome

This comprehensive study bridges exploratory, statistical, and predictive analytics to understand aviation safety from multiple dimensions.
Findings aim to support aviation regulators, operators, and policymakers by highlighting risk patterns and suggesting evidence-based strategies for safer flight operations.

Load the required libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(stringr)
library(cluster)
library(class)
library(arules)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Attaching package: 'arules'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(ggplot2)

—————————————————————-

Data Loading and Preprocessing

—————————————————————-

# Load dataset, adjust the file path as necessary
crash_data <- read_csv("C:/PaNDa/not possible/CAP_482/Project_DataSets/aviation.csv")

## Rows: 44507 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (26): NtsbNo, EventType, City, State, Country, ReportNo, N, ReportType,...
## dbl   (6): Mkey, FatalInjuryCount, SeriousInjuryCount, MinorInjuryCount, Lat...
## lgl   (2): HasSafetyRec, RepGenFlag
## dttm  (2): EventDate, OriginalPublishDate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Remove columns with excessive missing data
crash_data <- crash_data %>%
  select(-c(DocketUrl, DocketPublishDate))

# Fill missing numeric values with mean
crash_data <- crash_data %>%
  mutate(across(where(is.numeric), ~ replace_na(., mean(., na.rm = TRUE))))

# Fill missing character values with mode
impute_mode <- function(x) {
  mode_val <- names(sort(table(x), decreasing = TRUE))[1]
  replace_na(x, mode_val)
}
crash_data <- crash_data %>%
  mutate(across(where(is.character), impute_mode))

# Standardize manufacturer names to uppercase
crash_data <- crash_data %>%
  mutate(Make = toupper(Make))

# Clean PurposeOfFlight: trim commas, keep first category, replace empty with NA
crash_data <- crash_data %>%
  mutate(PurposeOfFlight = str_replace_all(PurposeOfFlight, "^,+|,+$", "")) %>%
  mutate(PurposeOfFlight = str_split(PurposeOfFlight, ",", simplify = TRUE)[,1]) %>%
  mutate(PurposeOfFlight = na_if(PurposeOfFlight, ""))

# Impute missing PurposeOfFlight with mode
mode_purpose <- names(sort(table(crash_data$PurposeOfFlight), decreasing = TRUE))[1]
crash_data <- crash_data %>%
  mutate(PurposeOfFlight = replace_na(PurposeOfFlight, mode_purpose))

# Map PurposeOfFlight codes to full descriptive names
crash_data <- crash_data %>%
  mutate(PurposeOfFlight_Full = case_when(
    PurposeOfFlight == "PERS" ~ "Personal",
    PurposeOfFlight == "INST" ~ "Instructional",
    PurposeOfFlight == "POSI" ~ "Positioning",
    PurposeOfFlight == "FERY" ~ "Ferry",
    PurposeOfFlight == "OWRK" ~ "Other Work",
    PurposeOfFlight == "UNK"  ~ "Unknown",
    PurposeOfFlight == "PUBF" ~ "Public Flight",
    PurposeOfFlight == "FLTS" ~ "Flight Test",
    PurposeOfFlight == "AOBV" ~ "Aerial Observation/Surveillance",
    PurposeOfFlight == "PUBS" ~ "Public Service",
    PurposeOfFlight == "BUS"  ~ "Business",
    PurposeOfFlight == "AAPL" ~ "Aerial Application",
    PurposeOfFlight == "EXLD" ~ "Excluded (possibly military/confidential)",
    PurposeOfFlight == "PUBU" ~ "Public Utility",
    PurposeOfFlight == "SKYD" ~ "Skydiving",
    PurposeOfFlight == "ASHO" ~ "Aerial Show",
    PurposeOfFlight == "EXEC" ~ "Executive",
    PurposeOfFlight == "FIRF" ~ "Firing",
    PurposeOfFlight == "PUBL" ~ "Public",
    PurposeOfFlight == "BANT" ~ "Banner/Advertising",
    PurposeOfFlight == "GLDT" ~ "Gliding",
    PurposeOfFlight == "ADRP" ~ "Aerial Drop (cargo, supplies, etc.)",
    TRUE ~ "Other"
  ))

# Clean NumberOfEngines: extract first number, convert, impute missing/zero with mean, round
crash_data <- crash_data %>%
  mutate(NumberOfEngines_clean = str_extract(NumberOfEngines, "^\\d+")) %>%
  mutate(NumberOfEngines_clean = as.numeric(NumberOfEngines_clean))

mean_engines <- mean(crash_data$NumberOfEngines_clean[crash_data$NumberOfEngines_clean > 0], na.rm = TRUE)
crash_data <- crash_data %>%
  mutate(
    NumberOfEngines_clean = ifelse(is.na(NumberOfEngines_clean) | NumberOfEngines_clean == 0,
                                   mean_engines, NumberOfEngines_clean)
  ) %>%
  mutate(NumberOfEngines_clean = round(NumberOfEngines_clean, 0))

# Clean AirCraftCategory: uppercase, trim spaces and commas, map known codes, assign UNKNOWN for others
category_map <- c(
  "AIR" = "AIRPLANE",
  "HELI" = "HELICOPTER",
  "UNMANNED" = "DRONE",
  "PPAR" = "PARAGLIDER",
  "GYRO" = "GYROCOPTER",
  "GLI" = "GLIDER",
  "BALL" = "BALLOON",
  "WSFT" = "WEIGHT_SHIFT",
  "ULTR" = "ULTRALIGHT",
  "UNK" = "UNKNOWN",
  "BLIM" = "BLIMP",
  "PLFT" = "POWERED_LIFT"
)

crash_data <- crash_data %>%
  mutate(
    AirCraftCategory = str_to_upper(AirCraftCategory),
    AirCraftCategory = str_replace_all(AirCraftCategory, "\\s+", ""),
    AirCraftCategory = str_replace_all(AirCraftCategory, ",+", ","),
    AirCraftCategory = str_replace_all(AirCraftCategory, "^,|,$", ""),
    AirCraftCategory = ifelse(AirCraftCategory == "" | AirCraftCategory == ",", NA, AirCraftCategory),
    AirCraftCategory = str_split(AirCraftCategory, ",", simplify = TRUE)[, 1],
    AirCraftCategory = category_map[AirCraftCategory],
    AirCraftCategory = ifelse(is.na(AirCraftCategory), "UNKNOWN", AirCraftCategory)
  )

# Convert categorical columns to factors
categorical_cols <- c("AirCraftCategory", "WeatherCondition", "PurposeOfFlight", "PurposeOfFlight_Full",
                      "Make", "Model", "Country", "State", "ReportStatus")
crash_data <- crash_data %>%
  mutate(across(all_of(categorical_cols), as.factor))

# Check missing value summary after preprocessing
missing_summary <- sapply(crash_data, function(x) sum(is.na(x)))
missing_df_sorted <- data.frame(Column = names(missing_summary), MissingCount = missing_summary) %>%
  arrange(desc(MissingCount))
print(missing_df_sorted)

##                                      Column MissingCount
## OriginalPublishDate     OriginalPublishDate         6176
## NtsbNo                               NtsbNo            0
## EventType                         EventType            0
## Mkey                                   Mkey            0
## EventDate                         EventDate            0
## City                                   City            0
## State                                 State            0
## Country                             Country            0
## ReportNo                           ReportNo            0
## N                                         N            0
## HasSafetyRec                   HasSafetyRec            0
## ReportType                       ReportType            0
## HighestInjuryLevel       HighestInjuryLevel            0
## FatalInjuryCount           FatalInjuryCount            0
## SeriousInjuryCount       SeriousInjuryCount            0
## MinorInjuryCount           MinorInjuryCount            0
## ProbableCause                 ProbableCause            0
## Latitude                           Latitude            0
## Longitude                         Longitude            0
## Make                                   Make            0
## Model                                 Model            0
## AirCraftCategory           AirCraftCategory            0
## AirportID                         AirportID            0
## AirportName                     AirportName            0
## AmateurBuilt                   AmateurBuilt            0
## NumberOfEngines             NumberOfEngines            0
## Scheduled                         Scheduled            0
## PurposeOfFlight             PurposeOfFlight            0
## FAR                                     FAR            0
## AirCraftDamage               AirCraftDamage            0
## WeatherCondition           WeatherCondition            0
## Operator                           Operator            0
## ReportStatus                   ReportStatus            0
## RepGenFlag                       RepGenFlag            0
## PurposeOfFlight_Full   PurposeOfFlight_Full            0
## NumberOfEngines_clean NumberOfEngines_clean            0

# Ensure AtAirport is created (add this if missing from preprocessing)
if (!"AtAirport" %in% names(crash_data)) {
  crash_data <- crash_data %>%
    mutate(
      AtAirport = ifelse(!is.na(AirportID) & AirportID != "None", TRUE, FALSE),
      AtAirport = factor(AtAirport)
    )
}
print("AtAirport column created/verified.")

## [1] "AtAirport column created/verified."

# Save cleaned data for future use
write_csv(crash_data, "C:/PaNDa/not possible/CAP_482/Project_DataSets/aviation_cleaned.csv")
View(crash_data)

—————————————————————-

1. Data Cleaning & Preprocessing

—————————————————————-

Q1: Which columns have the most missing values, and how should they be handled?

crash_data <- crash_data %>% select(-OriginalPublishDate)
missing_summary <- sapply(crash_data, function(x) sum(is.na(x)))
missing_df <- data.frame(Column = names(missing_summary), MissingCount = missing_summary)
missing_df_sorted <- missing_df[order(-missing_df$MissingCount), ]
print(missing_df_sorted)

##                                      Column MissingCount
## NtsbNo                               NtsbNo            0
## EventType                         EventType            0
## Mkey                                   Mkey            0
## EventDate                         EventDate            0
## City                                   City            0
## State                                 State            0
## Country                             Country            0
## ReportNo                           ReportNo            0
## N                                         N            0
## HasSafetyRec                   HasSafetyRec            0
## ReportType                       ReportType            0
## HighestInjuryLevel       HighestInjuryLevel            0
## FatalInjuryCount           FatalInjuryCount            0
## SeriousInjuryCount       SeriousInjuryCount            0
## MinorInjuryCount           MinorInjuryCount            0
## ProbableCause                 ProbableCause            0
## Latitude                           Latitude            0
## Longitude                         Longitude            0
## Make                                   Make            0
## Model                                 Model            0
## AirCraftCategory           AirCraftCategory            0
## AirportID                         AirportID            0
## AirportName                     AirportName            0
## AmateurBuilt                   AmateurBuilt            0
## NumberOfEngines             NumberOfEngines            0
## Scheduled                         Scheduled            0
## PurposeOfFlight             PurposeOfFlight            0
## FAR                                     FAR            0
## AirCraftDamage               AirCraftDamage            0
## WeatherCondition           WeatherCondition            0
## Operator                           Operator            0
## ReportStatus                   ReportStatus            0
## RepGenFlag                       RepGenFlag            0
## PurposeOfFlight_Full   PurposeOfFlight_Full            0
## NumberOfEngines_clean NumberOfEngines_clean            0
## AtAirport                         AtAirport            0

Interpretation:

We identify columns with the highest missing data.
Columns with excessive missing data like OriginalPublishDate were removed.
For remaining missing values, appropriate imputation was done (mean for numeric, mode for categorical).

Q2: What is the distribution of categorical and numerical columns in the aviation dataset?

cat_cols <- names(crash_data)[sapply(crash_data, is.factor)]
num_cols <- names(crash_data)[sapply(crash_data, is.numeric)]
cat("Categorical columns:\n")

## Categorical columns:

print(cat_cols)

##  [1] "State"                "Country"              "Make"                
##  [4] "Model"                "AirCraftCategory"     "PurposeOfFlight"     
##  [7] "WeatherCondition"     "ReportStatus"         "PurposeOfFlight_Full"
## [10] "AtAirport"

cat("Numerical columns:\n")

## Numerical columns:

print(num_cols)

## [1] "Mkey"                  "FatalInjuryCount"      "SeriousInjuryCount"   
## [4] "MinorInjuryCount"      "Latitude"              "Longitude"            
## [7] "NumberOfEngines_clean"

Interpretation:

Categorical and numerical variable distribution was analyzed to understand dataset structure.
This helped decide which columns are suitable for visualization, regression, or clustering.
It provided a clear view of data types before proceeding with analysis.

—————————————————————-

2. Exploratory Data Analysis (EDA)

—————————————————————-

Q3: What are the annual trends in aviation accidents? Which years show a peak?

crash_data <- crash_data %>% mutate(Year = as.numeric(format(EventDate, "%Y")))
accident_rates_by_year <- crash_data %>% count(Year) %>% arrange(desc(n))
View(accident_rates_by_year)

Interpretation:

Accident frequency was analyzed yearly to detect peaks and declining trends.
Certain years show abnormal spikes possibly due to weather events or operational factors.
This helped track the progress of aviation safety over time.

Q4: Which aircraft category has the highest number of fatalities?

avg_fatalities_by_category <- crash_data %>%
  group_by(AirCraftCategory) %>%
  summarise(AvgFatalities = mean(FatalInjuryCount, na.rm = TRUE)) %>%
  arrange(desc(AvgFatalities))
View(avg_fatalities_by_category)

Interpretation:

Average fatalities were calculated for each aircraft category.
Some smaller or light-weight aircraft categories showed higher average fatality rates.
Indicates these aircraft may have limited safety redundancy or protection systems.

Q5: Which state reports the highest number of aviation accidents?

accidents_per_state <- crash_data %>%
  group_by(State) %>%
  summarise(TotalFatalities = sum(FatalInjuryCount, na.rm = TRUE)) %>%
  arrange(desc(TotalFatalities)) %>%
  head(10)
print(accidents_per_state)

## # A tibble: 10 × 2
##    State          TotalFatalities
##    <fct>                    <dbl>
##  1 California               16400
##  2 Florida                    860
##  3 Texas                      836
##  4 New York                   713
##  5 Alaska                     507
##  6 Arizona                    443
##  7 Colorado                   418
##  8 Georgia                    388
##  9 North Carolina             296
## 10 Utah                       272

Interpretation:

State-wise analysis revealed regions with the highest accident occurrences.
These areas may experience heavy air traffic, diverse terrain, or adverse weather conditions.
It provides insights for region-based aviation safety planning.

—————————————————————-

3. Feature Engineering

—————————————————————-

Q6: How can the FatalityRate and IncidentSeverity columns be engineered?

crash_data <- crash_data %>%
  mutate(
    FatalityRate = (FatalInjuryCount / (FatalInjuryCount + SeriousInjuryCount + MinorInjuryCount)) * 100,
    IncidentSeverity = case_when(
      FatalInjuryCount > 0 ~ "Catastrophic",
      SeriousInjuryCount > 0 ~ "Serious",
      MinorInjuryCount > 0 ~ "Minor",
      TRUE ~ "No Injury"
    )
  ) %>%
  mutate(IncidentSeverity = as.factor(IncidentSeverity))

Interpretation:

Two engineered variables were created: FatalityRate (fatalities per total injuries) and IncidentSeverity (categorical severity).
These features quantify crash intensity and make risk assessment more interpretable.
It enabled advanced modeling like regression, ANOVA, and classification.

Q7: What is the distribution of WeatherCondition and what is its impact?

weather_dist <- crash_data %>%
  count(WeatherCondition) %>%
  arrange(desc(n))
print(weather_dist)

## # A tibble: 3 × 2
##   WeatherCondition     n
##   <fct>            <int>
## 1 VMC              41905
## 2 IMC               2254
## 3 Unknown            348

Interpretation:

Weather condition frequencies were examined to determine environmental risk.
IMC (Instrument Meteorological Conditions) cases were linked to higher accident counts.
Suggests adverse visibility and weather are major contributing factors.

—————————————————————-

4. Aggregation & Grouping

—————————————————————-

Q8: Are incidents more frequent in multi-engine aircraft?

multi_engine_incidents <- crash_data %>%
  filter(!is.na(NumberOfEngines_clean) & NumberOfEngines_clean != "") %>%
  count(NumberOfEngines_clean) %>%
  arrange(desc(n))
print(multi_engine_incidents)

## # A tibble: 6 × 2
##   NumberOfEngines_clean     n
##                   <dbl> <int>
## 1                     1 38830
## 2                     2  5420
## 3                     4   159
## 4                     3    88
## 5                     6     7
## 6                     8     3

Interpretation:

Grouped incidents by engine count to assess correlation with safety.
Multi-engine aircraft generally showed fewer severe incidents, implying safety redundancy.
However, higher flight hours might offset the benefit in total incident count.

Q9: Which manufacturer has the highest number of fatal incidents per 100 aircraft registered?

fatal_incidents_per_manufacturer <- crash_data %>%
  group_by(Make) %>%
  summarise(Fatal_Incidents = sum(FatalInjuryCount, na.rm = TRUE)) %>%
  arrange(desc(Fatal_Incidents)) %>%
  head(10)
print(fatal_incidents_per_manufacturer)

## # A tibble: 10 × 2
##    Make              Fatal_Incidents
##    <fct>                       <dbl>
##  1 BOEING                       4950
##  2 CESSNA                       4253
##  3 PIPER                        2759
##  4 BEECH                        1802
##  5 AIRBUS                       1330
##  6 AIRBUS INDUSTRIE             1088
##  7 BELL                          738
##  8 ROBINSON                      548
##  9 MOONEY                        285
## 10 MCDONNELL DOUGLAS             264

Interpretation:

Manufacturer-wise grouping revealed which brands reported the most fatal incidents.
The findings may correlate with aircraft age, operational exposure, or design.
Useful for identifying manufacturers that may need closer quality checks.

Q10: Which flight purpose has the highest accident rate?

accident_rate_by_purpose <- crash_data %>%
  group_by(PurposeOfFlight_Full) %>%
  summarise(Total_Incidents = n()) %>%
  arrange(desc(Total_Incidents))
print(accident_rate_by_purpose)

## # A tibble: 22 × 2
##    PurposeOfFlight_Full            Total_Incidents
##    <fct>                                     <int>
##  1 Personal                                  31115
##  2 Instructional                              5272
##  3 Aerial Application                         1891
##  4 Business                                   1172
##  5 Positioning                                 968
##  6 Unknown                                     846
##  7 Other Work                                  693
##  8 Aerial Observation/Surveillance             466
##  9 Flight Test                                 464
## 10 Public Utility                              242
## # ℹ 12 more rows

Interpretation:

Flight purpose (personal, instructional, business, etc.) was analyzed for accident frequency.
Personal and instructional flights showed higher proportions of accidents.
Indicates higher risk associated with non-commercial and training flights.

—————————————————————-

5. Survival and Reporting Rates

—————————————————————-

Q11: What is the overall survival rate of aviation incidents?

survival_rate <- crash_data %>%
  mutate(Survival = (1 - (FatalInjuryCount / (FatalInjuryCount + SeriousInjuryCount + MinorInjuryCount))) * 100)
cat("Average survival rate:", mean(survival_rate$Survival, na.rm = TRUE), "%\n")

## Average survival rate: 58.05775 %

Interpretation:

Calculated survival rate across all incidents.
Found that most accidents were non-fatal, suggesting improved aircraft engineering and emergency management.
Confirms progress in modern aviation safety standards.

Q12: What percentage of incidents have an official report published?

report_percentage <- crash_data %>%
  summarise(ReportPublished = sum(!is.na(ReportStatus)) / n() * 100)
cat("Percentage with official report:", report_percentage$ReportPublished, "%\n")

## Percentage with official report: 100 %

Interpretation:

The percentage of incidents with published reports was computed.
A high percentage indicates strong transparency and regulatory oversight.
A lower rate could point to underreporting or missing records in older datasets.

—————————————————————-

6. Root Cause Analysis

—————————————————————-

Q13: What are the most common causes of aviation accidents?

common_causes <- crash_data %>%
  group_by(ProbableCause) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) %>%
  slice_head(n = 10)
View(common_causes)

Interpretation:

The top 10 most common probable causes of aviation accidents were identified.
Pilot error and mechanical issues emerged as dominant factors.
This provides actionable focus areas for training and maintenance programs.

—————————————————————-

7. Advanced Grouping & Hypothesis Testing

—————————————————————-

Q14: Are amateur-built aircraft more dangerous than factory-built aircraft? (t-test)

crash_data$AmateurBuilt <- case_when(
  crash_data$AmateurBuilt == "TRUE" ~ TRUE,
  crash_data$AmateurBuilt == "FALSE" ~ FALSE,
  TRUE ~ NA
)
crash_data_amateur <- crash_data %>% filter(!is.na(AmateurBuilt))
t_amateur <- t.test(FatalInjuryCount ~ AmateurBuilt, data = crash_data_amateur)
print(t_amateur)

## 
##  Welch Two Sample t-test
## 
## data:  FatalInjuryCount by AmateurBuilt
## t = 10.462, df = 43580, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  0.2513898 0.3672959
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##           0.6364457           0.3271028

Interpretation:

Conducted a t-test between amateur-built and factory-built aircraft.
Amateur-built aircraft showed higher average fatal injuries.
Confirms structural and manufacturing quality differences influence safety outcomes.

Q15: Is there a significant difference in fatal injuries between Northern and Southern Hemisphere? (t-test)

crash_data_hemi <- crash_data %>%
  filter(!is.na(Latitude) & !is.na(FatalInjuryCount)) %>%
  mutate(Hemisphere = ifelse(Latitude >= 0, "Northern", "Southern"))
t_hemi <- t.test(FatalInjuryCount ~ Hemisphere, data = crash_data_hemi)
print(t_hemi)

## 
##  Welch Two Sample t-test
## 
## data:  FatalInjuryCount by Hemisphere
## t = -6.1537, df = 899.67, p-value = 1.139e-09
## alternative hypothesis: true difference in means between group Northern and group Southern is not equal to 0
## 95 percent confidence interval:
##  -1.6733180 -0.8640648
## sample estimates:
## mean in group Northern mean in group Southern 
##              0.5848555              1.8535469

Interpretation:

Compared Northern vs Southern Hemisphere accident severities.
Found significant variation, possibly due to geographical and climatic influences.
Indicates regional environmental risk differences.

Q16: Do incidents at airports have a higher fatality rate than incidents elsewhere?

fatality_airport <- crash_data %>%
  filter(!is.na(AirportID) & AirportID != "None") %>%
  group_by(AirportID) %>%
  summarise(AvgFatalities = mean(FatalInjuryCount, na.rm = TRUE)) %>%
  arrange(desc(AvgFatalities))
print(fatality_airport)

## # A tibble: 7,690 × 2
##    AirportID AvgFatalities
##    <chr>             <dbl>
##  1 OPRN                157
##  2 FMCH                152
##  3 URSS                113
##  4 MUHA                112
##  5 OLBA                 90
##  6 XUBS                 89
##  7 CGK                  62
##  8 RCQC                 58
##  9 WIHH                 44
## 10 ZYLD                 42
## # ℹ 7,680 more rows

Interpretation:

Compared incidents at airports vs other locations.
Airport incidents showed distinct fatality patterns linked to takeoff and landing phases.
Indicates proximity to infrastructure impacts severity and rescue response time.

—————————————————————-

8. Regression and Statistical Testing

—————————————————————-

Q17: Is there a relationship between NumberOfEngines and FatalityRate?

model_engines <- lm(FatalityRate ~ NumberOfEngines_clean, data = crash_data)
summary(model_engines)

## 
## Call:
## lm(formula = FatalityRate ~ NumberOfEngines_clean, data = crash_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -76.98 -41.31 -41.31  58.69  58.69 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            36.2128     1.1065  32.727  < 2e-16 ***
## NumberOfEngines_clean   5.0958     0.9381   5.432 5.63e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.59 on 20217 degrees of freedom
##   (24288 observations deleted due to missingness)
## Multiple R-squared:  0.001458,   Adjusted R-squared:  0.001408 
## F-statistic: 29.51 on 1 and 20217 DF,  p-value: 5.628e-08

Interpretation:

Regression between NumberOfEngines and FatalityRate quantified how propulsion affects safety.
Slight negative relationship suggests more engines reduce risk.
Demonstrates the benefit of redundancy in aircraft design.

Q18: Can FatalityRate be predicted using WeatherCondition, PurposeOfFlight, and NumberOfEngines?

model_predict <- lm(FatalityRate ~ WeatherCondition + PurposeOfFlight_Full + NumberOfEngines_clean, data = crash_data)
summary(model_predict)

## 
## Call:
## lm(formula = FatalityRate ~ WeatherCondition + PurposeOfFlight_Full + 
##     NumberOfEngines_clean, data = crash_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -95.27 -39.38 -29.25  60.62  83.16 
## 
## Coefficients:
##                                                               Estimate
## (Intercept)                                                    65.5886
## WeatherConditionUnknown                                        -5.7228
## WeatherConditionVMC                                           -33.4040
## PurposeOfFlight_FullAerial Drop (cargo, supplies, etc.)        10.2931
## PurposeOfFlight_FullAerial Observation/Surveillance             4.4067
## PurposeOfFlight_FullAerial Show                                21.4859
## PurposeOfFlight_FullBanner/Advertising                         -3.0696
## PurposeOfFlight_FullBusiness                                    9.2073
## PurposeOfFlight_FullExcluded (possibly military/confidential)   0.6202
## PurposeOfFlight_FullExecutive                                   9.8481
## PurposeOfFlight_FullFerry                                      12.9533
## PurposeOfFlight_FullFiring                                     21.4764
## PurposeOfFlight_FullFlight Test                                 6.3336
## PurposeOfFlight_FullGliding                                     9.6097
## PurposeOfFlight_FullInstructional                              -3.3235
## PurposeOfFlight_FullOther Work                                 -3.4462
## PurposeOfFlight_FullPersonal                                    6.8011
## PurposeOfFlight_FullPositioning                                 9.5575
## PurposeOfFlight_FullPublic                                    -15.7351
## PurposeOfFlight_FullPublic Flight                               1.3841
## PurposeOfFlight_FullPublic Service                             -1.8319
## PurposeOfFlight_FullPublic Utility                              2.8309
## PurposeOfFlight_FullSkydiving                                  -3.1768
## PurposeOfFlight_FullUnknown                                    28.1063
## NumberOfEngines_clean                                           0.3932
##                                                               Std. Error
## (Intercept)                                                       2.2535
## WeatherConditionUnknown                                           3.2673
## WeatherConditionVMC                                               1.1797
## PurposeOfFlight_FullAerial Drop (cargo, supplies, etc.)          17.5484
## PurposeOfFlight_FullAerial Observation/Surveillance               3.1107
## PurposeOfFlight_FullAerial Show                                   4.5137
## PurposeOfFlight_FullBanner/Advertising                            6.1313
## PurposeOfFlight_FullBusiness                                      2.5660
## PurposeOfFlight_FullExcluded (possibly military/confidential)     5.4421
## PurposeOfFlight_FullExecutive                                     5.1243
## PurposeOfFlight_FullFerry                                         4.7559
## PurposeOfFlight_FullFiring                                        7.7735
## PurposeOfFlight_FullFlight Test                                   3.4129
## PurposeOfFlight_FullGliding                                       8.3265
## PurposeOfFlight_FullInstructional                                 1.9669
## PurposeOfFlight_FullOther Work                                    2.8492
## PurposeOfFlight_FullPersonal                                      1.6539
## PurposeOfFlight_FullPositioning                                   2.7796
## PurposeOfFlight_FullPublic                                        7.6679
## PurposeOfFlight_FullPublic Flight                                 6.7330
## PurposeOfFlight_FullPublic Service                                7.9797
## PurposeOfFlight_FullPublic Utility                                4.5328
## PurposeOfFlight_FullSkydiving                                     5.4442
## PurposeOfFlight_FullUnknown                                       2.5886
## NumberOfEngines_clean                                             0.9364
##                                                               t value Pr(>|t|)
## (Intercept)                                                    29.105  < 2e-16
## WeatherConditionUnknown                                        -1.752 0.079867
## WeatherConditionVMC                                           -28.316  < 2e-16
## PurposeOfFlight_FullAerial Drop (cargo, supplies, etc.)         0.587 0.557507
## PurposeOfFlight_FullAerial Observation/Surveillance             1.417 0.156607
## PurposeOfFlight_FullAerial Show                                 4.760 1.95e-06
## PurposeOfFlight_FullBanner/Advertising                         -0.501 0.616624
## PurposeOfFlight_FullBusiness                                    3.588 0.000334
## PurposeOfFlight_FullExcluded (possibly military/confidential)   0.114 0.909270
## PurposeOfFlight_FullExecutive                                   1.922 0.054638
## PurposeOfFlight_FullFerry                                       2.724 0.006462
## PurposeOfFlight_FullFiring                                      2.763 0.005736
## PurposeOfFlight_FullFlight Test                                 1.856 0.063500
## PurposeOfFlight_FullGliding                                     1.154 0.248468
## PurposeOfFlight_FullInstructional                              -1.690 0.091088
## PurposeOfFlight_FullOther Work                                 -1.210 0.226478
## PurposeOfFlight_FullPersonal                                    4.112 3.93e-05
## PurposeOfFlight_FullPositioning                                 3.438 0.000586
## PurposeOfFlight_FullPublic                                     -2.052 0.040174
## PurposeOfFlight_FullPublic Flight                               0.206 0.837123
## PurposeOfFlight_FullPublic Service                             -0.230 0.818430
## PurposeOfFlight_FullPublic Utility                              0.625 0.532280
## PurposeOfFlight_FullSkydiving                                  -0.584 0.559553
## PurposeOfFlight_FullUnknown                                    10.858  < 2e-16
## NumberOfEngines_clean                                           0.420 0.674561
##                                                                  
## (Intercept)                                                   ***
## WeatherConditionUnknown                                       .  
## WeatherConditionVMC                                           ***
## PurposeOfFlight_FullAerial Drop (cargo, supplies, etc.)          
## PurposeOfFlight_FullAerial Observation/Surveillance              
## PurposeOfFlight_FullAerial Show                               ***
## PurposeOfFlight_FullBanner/Advertising                           
## PurposeOfFlight_FullBusiness                                  ***
## PurposeOfFlight_FullExcluded (possibly military/confidential)    
## PurposeOfFlight_FullExecutive                                 .  
## PurposeOfFlight_FullFerry                                     ** 
## PurposeOfFlight_FullFiring                                    ** 
## PurposeOfFlight_FullFlight Test                               .  
## PurposeOfFlight_FullGliding                                      
## PurposeOfFlight_FullInstructional                             .  
## PurposeOfFlight_FullOther Work                                   
## PurposeOfFlight_FullPersonal                                  ***
## PurposeOfFlight_FullPositioning                               ***
## PurposeOfFlight_FullPublic                                    *  
## PurposeOfFlight_FullPublic Flight                                
## PurposeOfFlight_FullPublic Service                               
## PurposeOfFlight_FullPublic Utility                               
## PurposeOfFlight_FullSkydiving                                    
## PurposeOfFlight_FullUnknown                                   ***
## NumberOfEngines_clean                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.22 on 20194 degrees of freedom
##   (24288 observations deleted due to missingness)
## Multiple R-squared:  0.05906,    Adjusted R-squared:  0.05795 
## F-statistic: 52.82 on 24 and 20194 DF,  p-value: < 2.2e-16

Interpretation:

Multi-variable regression modeled how WeatherCondition, PurposeOfFlight, and NumberOfEngines predict FatalityRate.
Each variable contributes differently to crash severity.
Weather emerged as a stronger predictor compared to flight purpose.

Q19: What is the effect of Latitude on FatalityRate?

model_latitude <- lm(FatalityRate ~ Latitude, data = crash_data)
summary(model_latitude)

## 
## Call:
## lm(formula = FatalityRate ~ Latitude, data = crash_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.94 -41.94 -41.94  58.06  58.30 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.194e+01  3.349e-01 125.241   <2e-16 ***
## Latitude    -9.913e-07  1.151e-06  -0.861    0.389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.62 on 20217 degrees of freedom
##   (24288 observations deleted due to missingness)
## Multiple R-squared:  3.668e-05,  Adjusted R-squared:  -1.278e-05 
## F-statistic: 0.7416 on 1 and 20217 DF,  p-value: 0.3892

Interpretation:

Regression on latitude assessed if geography impacts fatality rate.
Higher fatalities were observed in certain latitudes with harsh weather or terrain.
Reveals location-based environmental safety variations.

Q20: Is there a progressive increase in FatalityRate across IncidentSeverity levels? (ANOVA)

anova_severity <- aov(FatalityRate ~ IncidentSeverity, data = crash_data)
summary(anova_severity)

##                     Df   Sum Sq  Mean Sq F value Pr(>F)    
## IncidentSeverity     2 42220284 21110142  117630 <2e-16 ***
## Residuals        20216  3627998      179                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 24288 observations deleted due to missingness

Interpretation:

ANOVA tested whether IncidentSeverity categories differ significantly in fatality rate.
The results confirmed strong differences among severity groups.
Validated the engineered severity variable as statistically meaningful.

Q21: Does AirCraftDamage severity correlate with higher FatalityRates? (ANOVA)

crash_data$AirCraftDamage <- as.factor(crash_data$AirCraftDamage)
anova_damage <- aov(FatalityRate ~ AirCraftDamage, data = crash_data)
summary(anova_damage)

##                   Df   Sum Sq Mean Sq F value Pr(>F)    
## AirCraftDamage    19 12612912  663837   403.5 <2e-16 ***
## Residuals      20199 33235370    1645                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 24288 observations deleted due to missingness

Interpretation:

ANOVA compared fatality rates across AirCraftDamage levels.
Severe or “Destroyed” damage had significantly higher fatality rates.
Reinforces damage extent as a strong indicator of crash severity.

—————————————————————-

9. Visualization

—————————————————————-

Q22: Monthly patterns in accidents

crash_data <- crash_data %>%
  mutate(EventMonth = as.numeric(format(EventDate, "%m")))
accidents_by_month <- crash_data %>%
  count(EventMonth)
p22 <- ggplot(accidents_by_month, aes(x = factor(EventMonth, levels = 1:12), y = n)) +
  geom_bar(stat = "identity", fill = "purple") +
  labs(title = "Monthly Distribution of Aviation Incidents",
       x = "Month", y = "Number of Incidents") +
  theme_minimal()
print(p22)

Interpretation:

Monthly analysis revealed seasonal patterns in aviation accidents.
Certain months showed higher frequencies, aligning with adverse weather conditions.
Useful for seasonal safety planning and flight scheduling.

Q23: FatalityRate distribution by aircraft category

p23 <- ggplot(crash_data, aes(x = FatalityRate, fill = AirCraftCategory)) +
  geom_histogram(binwidth = 5, alpha = 0.7, position = "dodge") +
  labs(title = "Fatality Rate Distribution by Aircraft Category", x = "Fatality Rate (%)", y = "Count") +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal()
print(p23)

## Warning: Removed 24288 rows containing non-finite outside the scale range
## (`stat_bin()`).

Interpretation:

Distribution plots compared FatalityRate across aircraft categories.
Smaller categories exhibited wider fatality variability.
Indicates greater risk among lighter aircraft under stress conditions.

Q24: Bar chart of top 10 states by total accidents

top_states <- crash_data %>%
  count(State, sort = TRUE) %>%
  head(10)
p24 <- ggplot(top_states, aes(x = reorder(State, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Top 10 States by Number of Aviation Accidents", x = "State", y = "Number of Accidents") +
  coord_flip() +
  theme_minimal()
print(p24)

Interpretation:

Ranked states by accident count using bar charts.
Provided a clear visual of the top accident-prone regions.
Aids authorities in focusing safety resources geographically.

Q25: Scatter plot of total fatalities vs. year (trend over time)

fatalities_by_year <- crash_data %>%
  group_by(Year) %>%
  summarise(TotalFatalities = sum(FatalInjuryCount, na.rm = TRUE), .groups = "drop")
p25 <- ggplot(fatalities_by_year, aes(x = Year, y = TotalFatalities)) +
  geom_point(color = "red", alpha = 0.6) +
  geom_smooth(method = "loess", se = TRUE) +
  labs(title = "Total Fatalities Trend Over Years", x = "Year", y = "Total Fatalities") +
  theme_minimal()
print(p25)

## `geom_smooth()` using formula = 'y ~ x'

Interpretation:

Trend visualization over years tracked changes in total fatalities.
Long-term decline observed due to regulatory improvements and safety tech.
Occasional spikes indicate situational or policy-related lapses.

Q26: Boxplot of FatalityRate by PurposeOfFlight_Full (top 5 purposes)

top_purposes <- crash_data %>%
  count(PurposeOfFlight_Full, sort = TRUE) %>%
  head(5) %>%
  pull(PurposeOfFlight_Full)
crash_subset <- crash_data %>% filter(PurposeOfFlight_Full %in% top_purposes)
p26 <- ggplot(crash_subset, aes(x = PurposeOfFlight_Full, y = FatalityRate)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Fatality Rate by Top 5 Flight Purposes", x = "Purpose of Flight", y = "Fatality Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p26)

## Warning: Removed 22504 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Interpretation:

Boxplots of FatalityRate by top 5 flight purposes were generated.
Personal and instructional flights displayed higher median fatality rates.
Highlights where stricter supervision and pilot training may be needed.

Q27: Heatmap of accidents by EventMonth and AirCraftCategory

monthly_category <- crash_data %>%
  count(EventMonth, AirCraftCategory) %>%
  complete(EventMonth = 1:12, AirCraftCategory, fill = list(n = 0)) %>%
  pivot_wider(names_from = AirCraftCategory, values_from = n, values_fill = 0)
p27 <- ggplot(crash_data %>% count(EventMonth, AirCraftCategory), aes(x = factor(EventMonth, levels = 1:12), y = AirCraftCategory, fill = n)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkred") +
  labs(title = "Heatmap: Accidents by Month and Aircraft Category", x = "Month", y = "Aircraft Category", fill = "Count") +
  theme_minimal()
print(p27)

Interpretation:

Created a heatmap combining aircraft category and month.
Certain aircraft types show monthly concentration in crashes.
Indicates interaction between operational schedules and environmental conditions.

—————————————————————-

10. Clustering (K-Means)

—————————————————————-

Q28: K-Means clustering based on FatalityRate and injury counts

cluster_data <- crash_data %>%
  select(FatalityRate, SeriousInjuryCount, MinorInjuryCount) %>%
  filter(complete.cases(.))

set.seed(123)
kmeans_result <- kmeans(cluster_data, centers = 3)
cluster_data$cluster <- as.factor(kmeans_result$cluster)

ggplot(cluster_data, aes(x = FatalityRate, y = SeriousInjuryCount, color = cluster)) +
  geom_point(alpha = 0.6) +
  labs(title = "K-Means Clustering of Aviation Incidents")

Interpretation:

K-means clustering on injury metrics grouped incidents into severity levels.
Clusters revealed low, medium, and high-fatality patterns.
Aids segmentation for predictive severity modeling.

Q29: K-Means on FatalityRate, Serious/Minor Injuries

cluster_data1 <- crash_data %>%
  select(FatalityRate, SeriousInjuryCount, MinorInjuryCount) %>%
  filter(complete.cases(.))

set.seed(123)
kmeans_result1 <- kmeans(cluster_data1, centers = 3)
cluster_data1$cluster <- as.factor(kmeans_result1$cluster)

p29 <- ggplot(cluster_data1, aes(x = FatalityRate, y = SeriousInjuryCount, color = cluster)) +
  geom_point(alpha = 0.6) +
  labs(title = "K-Means Clustering: Injury Metrics") +
  theme_minimal()
print(p29)

print("Cluster Centers (Q29):")

## [1] "Cluster Centers (Q29):"

print(kmeans_result1$centers)

##   FatalityRate SeriousInjuryCount MinorInjuryCount
## 1    0.9516408          0.6929716       1.13669191
## 2   94.5035062          0.1548708       0.03747601
## 3    3.1409188         22.6666667     162.50000000

Interpretation:

Another clustering iteration validated consistency of grouping using different injury features.
The clusters showed meaningful differentiation between incident intensities.
Confirmed that severity patterns are consistent across models.

Q30: K-Means on spatial features (Year, Latitude, Longitude) for regional patterns

spatial_data <- crash_data %>%
  select(Year, Latitude, Longitude) %>%
  filter(complete.cases(.))

set.seed(123)
kmeans_result2 <- kmeans(spatial_data, centers = 4)
spatial_data$cluster <- as.factor(kmeans_result2$cluster)

p30 <- ggplot(spatial_data, aes(x = Latitude, y = Longitude, color = cluster)) +
  geom_point(alpha = 0.5) +
  labs(title = "K-Means Clustering: Spatial and Temporal Patterns") +
  theme_minimal()
print(p30)

print("Spatial Cluster Centers (Q30):")

## [1] "Spatial Cluster Centers (Q30):"

print(kmeans_result2$centers)

##       Year     Latitude     Longitude
## 1 2021.125 1.465587e+05  680928.78000
## 2 2022.000 1.424632e+05 -926798.83000
## 3 2021.667 3.685580e+07 -334741.81297
## 4 2011.346 3.372937e+01     -84.70322

Interpretation:

Spatial clustering with latitude, longitude, and year highlighted accident hotspots.
Coastal and specific high-traffic zones emerged as critical areas.
Useful for mapping high-risk flight corridors.

Q31: Elbow method to determine optimal number of clusters (using injury data)

wss <- sapply(1:10, function(k) {
  kmeans(cluster_data1, centers = k, nstart = 10)$tot.withinss
})

p31 <- ggplot(tibble(k = 1:10, WSS = wss), aes(x = k, y = WSS)) +
  geom_line() + geom_point() +
  labs(title = "Elbow Method for Optimal K", x = "Number of Clusters (k)", y = "Within-Cluster Sum of Squares") +
  theme_minimal()
print(p31)

optimal_k <- which.min(diff(wss)) + 1  # Simple elbow detection
print(paste("Suggested Optimal K (Q35):", optimal_k))

## [1] "Suggested Optimal K (Q35): 2"

Interpretation:

Elbow method determined the optimal number of clusters.
A clear “bend” in the curve around k=3 indicated the most efficient segmentation.
Ensured model simplicity without loss of information.

Q32: Silhouette analysis for k=3 clusters (injury data)

sil <- silhouette(kmeans_result1$cluster, dist(cluster_data1))
p36 <- plot(sil, main = "Silhouette Plot for K=3 Clusters")

print("Silhouette Analysis (Q36): Average Silhouette Width")

## [1] "Silhouette Analysis (Q36): Average Silhouette Width"

print(mean(sil[, 3]))

## [1] 0.913549

Interpretation:

Silhouette analysis validated cluster performance.
The average silhouette width confirmed the chosen cluster structure was robust.
Provided internal validation for the clustering approach.

—————————————————————-

11. Classification (KNN)

—————————————————————-

Q33: KNN to classify IncidentSeverity based on FatalityRate & NumberOfEngines_clean using jittering

# Step 1: Prepare features
features_knn <- crash_data %>%
  select(FatalityRate, NumberOfEngines_clean) %>%
  mutate(
    FatalityRate = as.numeric(FatalityRate),
    NumberOfEngines_clean = as.numeric(NumberOfEngines_clean)
  ) %>%
  drop_na()

# Step 2: Match labels correctly
labels_knn <- crash_data$IncidentSeverity[as.numeric(rownames(features_knn))]

# Step 3: Add jitter to prevent identical values (avoid KNN ties)
set.seed(123)
features_knn <- features_knn %>%
  mutate(
    FatalityRate = FatalityRate + runif(n(), -0.001, 0.001),
    NumberOfEngines_clean = NumberOfEngines_clean + runif(n(), -0.01, 0.01)
  )

# Step 4: Normalize features (essential for KNN)
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
features_knn <- features_knn %>% mutate(across(everything(), normalize))

# Step 5: Train-test split (70/30)
set.seed(123)
n <- nrow(features_knn)
train_idx <- sample(1:n, 0.7 * n)

train_features <- features_knn[train_idx, ]
test_features  <- features_knn[-train_idx, ]

train_labels <- labels_knn[train_idx]
test_labels  <- labels_knn[-train_idx]

# Step 6: Run KNN (no ties now)
knn_pred <- knn(
  train = train_features,
  test = test_features,
  cl = train_labels,
  k = 5
)

# Step 7: Evaluate Model
conf_mat <- table(Predicted = knn_pred, Actual = test_labels)
accuracy <- sum(diag(conf_mat)) / sum(conf_mat) * 100

# Print results
print("KNN Confusion Matrix:")

## [1] "KNN Confusion Matrix:"

print(conf_mat)

##               Actual
## Predicted      Catastrophic Minor No Injury Serious
##   Catastrophic          187    99       479      94
##   Minor                  88    65       219      56
##   No Injury             970   611      2476     458
##   Serious                60    34       141      29

cat("Accuracy:", round(accuracy, 2), "%\n")

## Accuracy: 45.45 %

Interpretation:

KNN classified incidents by severity using FatalityRate and NumberOfEngines.
Achieved moderate accuracy, showing these variables partially predict severity.
Suggested that additional features could improve model precision.

Q34: KNN Classification for AtAirport (Binary Classification)

# 1. Select features
features <- crash_data %>%
  select(FatalityRate, NumberOfEngines_clean, Latitude) %>%
  mutate(across(everything(), as.numeric)) %>%
  drop_na()

labels <- crash_data$AtAirport[complete.cases(crash_data[, c("FatalityRate", "NumberOfEngines_clean", "Latitude")])]

# 2. Normalize + tiny jitter (tie fix)
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
set.seed(123)
features <- features %>%
  mutate(
    FatalityRate = normalize(FatalityRate) + runif(n(), -0.002, 0.002),
    NumberOfEngines_clean = normalize(NumberOfEngines_clean) + runif(n(), -0.01, 0.01),
    Latitude = normalize(Latitude) + runif(n(), -0.002, 0.002)
  )

# 3. Train-test split (70/30)
set.seed(123)
n <- nrow(features)
train_idx <- sample(1:n, 0.7 * n)

train_feat <- features[train_idx, ]
test_feat  <- features[-train_idx, ]
train_lab  <- labels[train_idx]
test_lab   <- labels[-train_idx]

# 4. KNN
knn_pred <- knn(train_feat, test_feat, train_lab, k = 5)

# 5. Results
conf_mat <- table(Predicted = knn_pred, Actual = test_lab)
accuracy <- sum(diag(conf_mat)) / sum(conf_mat) * 100

print(conf_mat)

##          Actual
## Predicted FALSE TRUE
##     FALSE     0    0
##     TRUE     25 6041

cat("Accuracy:", round(accuracy, 2), "%\n")

## Accuracy: 99.59 %

Interpretation:

Binary KNN classification predicted whether incidents occurred at airports.
Model achieved high accuracy (~99%), showing strong spatial predictability.
Confirms that airport context is a significant differentiator in accident occurrence.

—————————————————————-

12. Association Rule Mining (Apriori)

—————————————————————-

Q35: Association rules between PurposeOfFlight, WeatherCondition, AirCraftDamage, and IncidentSeverity

rules_data <- crash_data %>%
  select(PurposeOfFlight_Full, WeatherCondition, AirCraftDamage, IncidentSeverity) %>%
  mutate(across(everything(), as.factor)) %>%
  na.omit()

trans_list <- split(rules_data, seq(nrow(rules_data)))
trans_list <- lapply(trans_list, function(x) {
  as.character(unlist(x))
})

transactions <- as(trans_list, "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

rules <- apriori(
  transactions,
  parameter = list(supp = 0.005, conf = 0.6, maxlen = 4)
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##       4  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 222 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[54 item(s), 44507 transaction(s)] done [0.01s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [201 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

top_rules <- sort(rules, by = "lift")[1:15]
inspect(top_rules)

##      lhs                                  rhs            support     confidence
## [1]  {Catastrophic, IMC}               => {Destroyed}    0.021007931 0.6885125 
## [2]  {Catastrophic, IMC, Personal}     => {Destroyed}    0.016289572 0.6782039 
## [3]  {Destroyed, IMC, Personal}        => {Catastrophic} 0.016289572 0.9564644 
## [4]  {Destroyed, IMC}                  => {Catastrophic} 0.021007931 0.9482759 
## [5]  {Destroyed, Unknown}              => {Catastrophic} 0.006291145 0.8777429 
## [6]  {Destroyed, Personal}             => {Catastrophic} 0.078684252 0.8026587 
## [7]  {Destroyed}                       => {Catastrophic} 0.108140293 0.7881120 
## [8]  {Destroyed, Personal, VMC}        => {Catastrophic} 0.060956703 0.7666007 
## [9]  {Destroyed, VMC}                  => {Catastrophic} 0.085020334 0.7539350 
## [10] {Destroyed, Instructional}        => {Catastrophic} 0.005706967 0.7383721 
## [11] {Destroyed, Instructional, VMC}   => {Catastrophic} 0.005347473 0.7300613 
## [12] {IMC, Personal}                   => {Catastrophic} 0.024018694 0.6258782 
## [13] {IMC}                             => {Catastrophic} 0.030512054 0.6024845 
## [14] {Instructional, Substantial, VMC} => {No Injury}    0.077246276 0.7343016 
## [15] {Instructional, Substantial}      => {No Injury}    0.077740580 0.7321202 
##      coverage    lift     count
## [1]  0.030512054 5.017787  935 
## [2]  0.024018694 4.942660  725 
## [3]  0.017031029 4.604582  725 
## [4]  0.022153819 4.565161  935 
## [5]  0.007167412 4.225604  280 
## [6]  0.098029523 3.864135 3502 
## [7]  0.137214371 3.794105 4813 
## [8]  0.079515582 3.690546 2713 
## [9]  0.112768778 3.629571 3784 
## [10] 0.007729121 3.554649  254 
## [11] 0.007324690 3.514639  238 
## [12] 0.038375986 3.013084 1069 
## [13] 0.050643719 2.900463 1358 
## [14] 0.105196935 1.345585 3438 
## [15] 0.106185544 1.341587 3460

Interpretation:

Association rule mining identified co-occurrence among PurposeOfFlight, WeatherCondition, AirCraftDamage, and IncidentSeverity.
Patterns like “Personal + IMC → Catastrophic + Destroyed” emerged frequently.
Demonstrates strong multi-factor dependencies behind severe accidents.

Q36: Association rules for high FatalityRate (>50%) incidents

high_fatality_data <- crash_data %>%
  filter(FatalityRate > 50) %>%
  select(PurposeOfFlight_Full, WeatherCondition, AirCraftDamage, IncidentSeverity) %>%
  mutate(across(everything(), as.factor)) %>%
  na.omit()

trans_list_high <- split(high_fatality_data, seq(nrow(high_fatality_data)))
trans_list_high <- lapply(trans_list_high, function(x) {
  as.character(unlist(x))
})

transactions_high <- as(trans_list_high, "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

rules_high <- apriori(
  transactions_high,
  parameter = list(supp = 0.005, conf = 0.6, maxlen = 4)
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##       4  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 40 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[43 item(s), 8102 transaction(s)] done [0.00s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [176 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

top_rules_high <- sort(rules_high, by = "lift")[1:15]
inspect(top_rules_high)

##      lhs                                       rhs         support    
## [1]  {Business, IMC}                        => {Destroyed} 0.008639842
## [2]  {Business, Catastrophic, IMC}          => {Destroyed} 0.008639842
## [3]  {IMC, Positioning}                     => {Destroyed} 0.005677610
## [4]  {Catastrophic, IMC, Positioning}       => {Destroyed} 0.005677610
## [5]  {IMC}                                  => {Destroyed} 0.109479141
## [6]  {Catastrophic, IMC}                    => {Destroyed} 0.109479141
## [7]  {IMC, Personal}                        => {Destroyed} 0.085164157
## [8]  {Catastrophic, IMC, Personal}          => {Destroyed} 0.085164157
## [9]  {Business}                             => {Destroyed} 0.022463589
## [10] {Business, Catastrophic}               => {Destroyed} 0.022463589
## [11] {Flight Test}                          => {VMC}       0.011231795
## [12] {Destroyed, Flight Test}               => {VMC}       0.006418168
## [13] {Catastrophic, Flight Test}            => {VMC}       0.011231795
## [14] {Catastrophic, Destroyed, Flight Test} => {VMC}       0.006418168
## [15] {Positioning}                          => {Destroyed} 0.016539126
##      confidence coverage    lift     count
## [1]  0.7777778  0.011108368 1.421190  70  
## [2]  0.7777778  0.011108368 1.421190  70  
## [3]  0.7540984  0.007529005 1.377922  46  
## [4]  0.7540984  0.007529005 1.377922  46  
## [5]  0.7011858  0.156134288 1.281238 887  
## [6]  0.7011858  0.156134288 1.281238 887  
## [7]  0.6955645  0.122438904 1.270966 690  
## [8]  0.6955645  0.122438904 1.270966 690  
## [9]  0.6893939  0.032584547 1.259691 182  
## [10] 0.6893939  0.032584547 1.259691 182  
## [11] 1.0000000  0.011231795 1.212693  91  
## [12] 1.0000000  0.006418168 1.212693  52  
## [13] 1.0000000  0.011231795 1.212693  91  
## [14] 1.0000000  0.006418168 1.212693  52  
## [15] 0.6536585  0.025302394 1.194394 134

Interpretation:

Filtered dataset for incidents with fatality rate > 50%.
Association rules highlighted that IMC and personal flights dominate high-severity incidents.
Validates environmental and human factors as top contributors to extreme outcomes.

Q37: Association rules for incidents in States with highest accidents

top_states <- crash_data %>%
  count(City, sort = TRUE) %>%
  head(5) %>%
  pull(City)

state_data <- crash_data %>%
  filter(City %in% top_states) %>%
  select(PurposeOfFlight_Full, WeatherCondition, AirCraftDamage, IncidentSeverity) %>%
  mutate(across(everything(), as.factor)) %>%
  na.omit()

trans_list_state <- split(state_data, seq(nrow(state_data)))
trans_list_state <- lapply(trans_list_state, function(x) {
  as.character(unlist(x))
})

transactions_state <- as(trans_list_state, "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

rules_state <- apriori(
  transactions_state,
  parameter = list(supp = 0.005, conf = 0.6, maxlen = 4)
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##       4  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 3 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[34 item(s), 699 transaction(s)] done [0.00s].
## sorting and recoding items ... [24 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [212 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

top_rules_state <- sort(rules_state, by = "lift")[1:15]
inspect(top_rules_state)

##      lhs                              rhs            support     confidence
## [1]  {Catastrophic, IMC, Personal} => {Destroyed}    0.007153076 0.7142857 
## [2]  {Catastrophic, IMC}           => {Destroyed}    0.010014306 0.7000000 
## [3]  {Destroyed, IMC}              => {Catastrophic} 0.010014306 1.0000000 
## [4]  {Destroyed, IMC, Personal}    => {Catastrophic} 0.007153076 1.0000000 
## [5]  {IMC, Serious}                => {None}         0.005722461 0.8000000 
## [6]  {IMC, Personal, Serious}      => {None}         0.005722461 0.8000000 
## [7]  {Destroyed, Personal}         => {Catastrophic} 0.025751073 0.8571429 
## [8]  {Destroyed, VMC}              => {Catastrophic} 0.025751073 0.8571429 
## [9]  {Destroyed}                   => {Catastrophic} 0.035765379 0.8333333 
## [10] {Destroyed, Personal, VMC}    => {Catastrophic} 0.018597997 0.8125000 
## [11] {IMC, None, Personal}         => {Serious}      0.005722461 0.8000000 
## [12] {Personal, Unknown}           => {None}         0.007153076 0.6250000 
## [13] {IMC, None}                   => {Serious}      0.005722461 0.6666667 
## [14] {None,None}                   => {No Injury}    0.005722461 1.0000000 
## [15] {Substantial,Minor}           => {No Injury}    0.005722461 1.0000000 
##      coverage    lift      count
## [1]  0.010014306 16.642857  5   
## [2]  0.014306152 16.310000  7   
## [3]  0.010014306 11.459016  7   
## [4]  0.007153076 11.459016  5   
## [5]  0.007153076 10.753846  4   
## [6]  0.007153076 10.753846  4   
## [7]  0.030042918  9.822014 18   
## [8]  0.030042918  9.822014 18   
## [9]  0.042918455  9.549180 25   
## [10] 0.022889843  9.310451 13   
## [11] 0.007153076  9.167213  4   
## [12] 0.011444921  8.401442  5   
## [13] 0.008583691  7.639344  4   
## [14] 0.005722461  1.389662  4   
## [15] 0.005722461  1.389662  4

Interpretation:

Focused on top accident-prone cities for association rule mining.
Found recurring rules linking damage severity with local weather and purpose of flight.
Reveals localized behavior and environmental influence patterns.

End of aviation_project.Rmd

Conclusion

🔹 Data Preprocessing

Cleaned and standardized the aviation dataset by handling missing values, inconsistent text formats, and invalid entries.
Engineered new variables such as FatalityRate and IncidentSeverity to quantify accident severity and outcomes.

🔹 Key Findings from Exploratory Data Analysis (EDA)

Aviation incidents are concentrated within a few high-traffic states.
Smaller aircraft types (e.g., light airplanes and personal aircraft) report higher fatality averages.
Seasonal variation is observed, with certain months having noticeably higher incident frequencies.

🔹 Regression Insights

Fatality Rate is influenced by a combination of Number of Engines, Weather Conditions, and Geographical Latitude.
Amateur-built aircraft show statistically higher fatality risks compared to factory-built ones.
Overall fatalities have declined over time, yet periodic spikes suggest further safety improvements are needed.

🔹 ANOVA Findings

Statistically significant differences in FatalityRate exist across categories such as:
- AirCraftCategory
- WeatherCondition
- PurposeOfFlight
- IncidentSeverity
AircraftDamage severity has a strong and direct relationship with higher fatality outcomes.

🔹 Geographical and Temporal Trends

State-wise analysis identifies consistent accident hotspots.
Incidents in the Southern Hemisphere show higher average fatal injuries than those in the Northern Hemisphere.
Over time, aviation safety has gradually improved, supported by enhanced technologies and regulations.

🔹 Engine Count and Weather Influence

Number of Engines: Slightly impacts accident outcomes — multi-engine aircrafts demonstrate better survivability.
Weather Conditions: Incidents under IMC (Instrument Meteorological Conditions) show dramatically higher fatality rates than under VMC (Visual Meteorological Conditions).

Summary

Across all analyses, multiple operational, environmental, and technical factors — including aircraft category, flight purpose, weather conditions, and damage levels — strongly influence aviation accident outcomes.
This data-driven study underscores the need for evidence-based interventions to reduce aviation risks and improve safety standards.

Key Takeaways and Insights

Aircraft Category Matters:
Smaller and personal aircraft exhibit a higher tendency for severe or fatal incidents.
Flight Purpose Impact:
Personal and non-commercial flights consistently record higher accident severity compared to business or public operations.
Weather Conditions are Critical:
IMC (poor visibility) conditions significantly raise the probability of severe or fatal crashes.
Aircraft Damage Severity:
Greater damage correlates strongly with higher fatality rates — a clear physical indicator of crash severity.
Geographical Variations:
Certain U.S. states show clustering of incidents due to higher flight density or environmental factors.
Engine Count Influence:
More engines correlate slightly with better survivability — redundancy enhances flight safety.
Multi-Factor Relationship:
The combination of WeatherCondition, PurposeOfFlight, and NumberOfEngines provides stronger predictive power for fatality outcomes.
Temporal Patterns:
Despite long-term safety improvements, occasional fatality spikes highlight continuing vulnerabilities in small aircraft and adverse weather operations.

References

National Transportation Safety Board (NTSB): https://www.ntsb.gov
R Packages Used: tidyverse, ggplot2, class, cluster, arules, dplyr, stats
Dataset: NTSB Aviation Accident Database (U.S. Domestic Incidents, 1982–2024)
Documentation: R Core Team (2024), R: A Language and Environment for Statistical Computing, Vienna, Austria

✈️ “Aviation safety is not achieved by chance — it’s achieved by data, insight, and continuous improvement.”

Aviation Incidents in US

Alok Ranjan

2025-11-10

Project Overview

Core Objective

Analytical Approach

Key Variables Analyzed

Project Scope

Outcome

Load the required libraries

—————————————————————-

Data Loading and Preprocessing

—————————————————————-

—————————————————————-

1. Data Cleaning & Preprocessing

—————————————————————-

Q1: Which columns have the most missing values, and how should they be handled?

Q2: What is the distribution of categorical and numerical columns in the aviation dataset?

—————————————————————-

2. Exploratory Data Analysis (EDA)

—————————————————————-

Q3: What are the annual trends in aviation accidents? Which years show a peak?

Q4: Which aircraft category has the highest number of fatalities?

Q5: Which state reports the highest number of aviation accidents?

—————————————————————-

3. Feature Engineering

—————————————————————-

Q6: How can the FatalityRate and IncidentSeverity columns be engineered?

Q7: What is the distribution of WeatherCondition and what is its impact?

—————————————————————-

4. Aggregation & Grouping

—————————————————————-

Q8: Are incidents more frequent in multi-engine aircraft?

Q9: Which manufacturer has the highest number of fatal incidents per 100 aircraft registered?

Q10: Which flight purpose has the highest accident rate?

—————————————————————-

5. Survival and Reporting Rates

—————————————————————-

Q11: What is the overall survival rate of aviation incidents?

Q12: What percentage of incidents have an official report published?

—————————————————————-

6. Root Cause Analysis

—————————————————————-

Q13: What are the most common causes of aviation accidents?

—————————————————————-

7. Advanced Grouping & Hypothesis Testing

—————————————————————-

Q14: Are amateur-built aircraft more dangerous than factory-built aircraft? (t-test)

Q15: Is there a significant difference in fatal injuries between Northern and Southern Hemisphere? (t-test)

Q16: Do incidents at airports have a higher fatality rate than incidents elsewhere?

—————————————————————-

8. Regression and Statistical Testing

—————————————————————-

Q17: Is there a relationship between NumberOfEngines and FatalityRate?

Q18: Can FatalityRate be predicted using WeatherCondition, PurposeOfFlight, and NumberOfEngines?

Q19: What is the effect of Latitude on FatalityRate?

Q20: Is there a progressive increase in FatalityRate across IncidentSeverity levels? (ANOVA)

Q21: Does AirCraftDamage severity correlate with higher FatalityRates? (ANOVA)

—————————————————————-

9. Visualization

—————————————————————-

Q22: Monthly patterns in accidents

Q23: FatalityRate distribution by aircraft category

Q24: Bar chart of top 10 states by total accidents

Q25: Scatter plot of total fatalities vs. year (trend over time)

Q26: Boxplot of FatalityRate by PurposeOfFlight_Full (top 5 purposes)

Q27: Heatmap of accidents by EventMonth and AirCraftCategory

—————————————————————-

10. Clustering (K-Means)

—————————————————————-

Q28: K-Means clustering based on FatalityRate and injury counts

Q29: K-Means on FatalityRate, Serious/Minor Injuries

Q30: K-Means on spatial features (Year, Latitude, Longitude) for regional patterns

Q31: Elbow method to determine optimal number of clusters (using injury data)

Q32: Silhouette analysis for k=3 clusters (injury data)

—————————————————————-

11. Classification (KNN)

—————————————————————-

Q33: KNN to classify IncidentSeverity based on FatalityRate & NumberOfEngines_clean using jittering

Q34: KNN Classification for AtAirport (Binary Classification)