This project presents an end-to-end data-driven analysis of
aviation accidents in the United States, using real-world data
from the National Transportation Safety Board
(NTSB).
The study focuses on exploring accident patterns, severity factors, and
the statistical relationships between aircraft, weather, operational
conditions, and outcomes.
To uncover actionable insights that help understand the causes, frequency, and severity of aviation incidents, while identifying factors that significantly influence accident outcomes.
The project employs a systematic approach involving: 1. Data
Cleaning & Preprocessing — Handling missing data,
standardizing formats, and engineering critical variables such as
FatalityRate and IncidentSeverity. 2.
Exploratory Data Analysis (EDA) — Understanding
patterns through visualizations and grouping by aircraft type, location,
and weather conditions. 3. Statistical Testing (t-test,
Regression, ANOVA) — Examining relationships between continuous
and categorical variables like: - Engine count vs fatality rate
- Weather and purpose of flight vs accident severity
- Aircraft damage levels and their impact on fatalities
4. Clustering (K-Means) — Grouping incidents by
severity and injury counts to identify natural patterns and high-risk
clusters. 5. Classification (KNN) — Predicting severity
categories and incident occurrence at airports using normalized feature
data. 6. Association Rule Mining (Apriori) —
Discovering interrelated patterns between flight purpose, weather, and
damage type to reveal root-cause linkages.
The project spans 34 analytical questions covering:
- Accident trends and distribution across years and locations
- Effect of flight purpose, weather, and aircraft type on
fatalities
- Statistical modeling and clustering for predictive insights
- Rule-based analysis to uncover hidden correlations
This comprehensive study bridges exploratory, statistical, and
predictive analytics to understand aviation safety from multiple
dimensions.
Findings aim to support aviation regulators, operators, and
policymakers by highlighting risk patterns and suggesting
evidence-based strategies for safer flight
operations.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(cluster)
library(class)
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Attaching package: 'arules'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(ggplot2)
# Load dataset, adjust the file path as necessary
crash_data <- read_csv("C:/PaNDa/not possible/CAP_482/Project_DataSets/aviation.csv")
## Rows: 44507 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): NtsbNo, EventType, City, State, Country, ReportNo, N, ReportType,...
## dbl (6): Mkey, FatalInjuryCount, SeriousInjuryCount, MinorInjuryCount, Lat...
## lgl (2): HasSafetyRec, RepGenFlag
## dttm (2): EventDate, OriginalPublishDate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove columns with excessive missing data
crash_data <- crash_data %>%
select(-c(DocketUrl, DocketPublishDate))
# Fill missing numeric values with mean
crash_data <- crash_data %>%
mutate(across(where(is.numeric), ~ replace_na(., mean(., na.rm = TRUE))))
# Fill missing character values with mode
impute_mode <- function(x) {
mode_val <- names(sort(table(x), decreasing = TRUE))[1]
replace_na(x, mode_val)
}
crash_data <- crash_data %>%
mutate(across(where(is.character), impute_mode))
# Standardize manufacturer names to uppercase
crash_data <- crash_data %>%
mutate(Make = toupper(Make))
# Clean PurposeOfFlight: trim commas, keep first category, replace empty with NA
crash_data <- crash_data %>%
mutate(PurposeOfFlight = str_replace_all(PurposeOfFlight, "^,+|,+$", "")) %>%
mutate(PurposeOfFlight = str_split(PurposeOfFlight, ",", simplify = TRUE)[,1]) %>%
mutate(PurposeOfFlight = na_if(PurposeOfFlight, ""))
# Impute missing PurposeOfFlight with mode
mode_purpose <- names(sort(table(crash_data$PurposeOfFlight), decreasing = TRUE))[1]
crash_data <- crash_data %>%
mutate(PurposeOfFlight = replace_na(PurposeOfFlight, mode_purpose))
# Map PurposeOfFlight codes to full descriptive names
crash_data <- crash_data %>%
mutate(PurposeOfFlight_Full = case_when(
PurposeOfFlight == "PERS" ~ "Personal",
PurposeOfFlight == "INST" ~ "Instructional",
PurposeOfFlight == "POSI" ~ "Positioning",
PurposeOfFlight == "FERY" ~ "Ferry",
PurposeOfFlight == "OWRK" ~ "Other Work",
PurposeOfFlight == "UNK" ~ "Unknown",
PurposeOfFlight == "PUBF" ~ "Public Flight",
PurposeOfFlight == "FLTS" ~ "Flight Test",
PurposeOfFlight == "AOBV" ~ "Aerial Observation/Surveillance",
PurposeOfFlight == "PUBS" ~ "Public Service",
PurposeOfFlight == "BUS" ~ "Business",
PurposeOfFlight == "AAPL" ~ "Aerial Application",
PurposeOfFlight == "EXLD" ~ "Excluded (possibly military/confidential)",
PurposeOfFlight == "PUBU" ~ "Public Utility",
PurposeOfFlight == "SKYD" ~ "Skydiving",
PurposeOfFlight == "ASHO" ~ "Aerial Show",
PurposeOfFlight == "EXEC" ~ "Executive",
PurposeOfFlight == "FIRF" ~ "Firing",
PurposeOfFlight == "PUBL" ~ "Public",
PurposeOfFlight == "BANT" ~ "Banner/Advertising",
PurposeOfFlight == "GLDT" ~ "Gliding",
PurposeOfFlight == "ADRP" ~ "Aerial Drop (cargo, supplies, etc.)",
TRUE ~ "Other"
))
# Clean NumberOfEngines: extract first number, convert, impute missing/zero with mean, round
crash_data <- crash_data %>%
mutate(NumberOfEngines_clean = str_extract(NumberOfEngines, "^\\d+")) %>%
mutate(NumberOfEngines_clean = as.numeric(NumberOfEngines_clean))
mean_engines <- mean(crash_data$NumberOfEngines_clean[crash_data$NumberOfEngines_clean > 0], na.rm = TRUE)
crash_data <- crash_data %>%
mutate(
NumberOfEngines_clean = ifelse(is.na(NumberOfEngines_clean) | NumberOfEngines_clean == 0,
mean_engines, NumberOfEngines_clean)
) %>%
mutate(NumberOfEngines_clean = round(NumberOfEngines_clean, 0))
# Clean AirCraftCategory: uppercase, trim spaces and commas, map known codes, assign UNKNOWN for others
category_map <- c(
"AIR" = "AIRPLANE",
"HELI" = "HELICOPTER",
"UNMANNED" = "DRONE",
"PPAR" = "PARAGLIDER",
"GYRO" = "GYROCOPTER",
"GLI" = "GLIDER",
"BALL" = "BALLOON",
"WSFT" = "WEIGHT_SHIFT",
"ULTR" = "ULTRALIGHT",
"UNK" = "UNKNOWN",
"BLIM" = "BLIMP",
"PLFT" = "POWERED_LIFT"
)
crash_data <- crash_data %>%
mutate(
AirCraftCategory = str_to_upper(AirCraftCategory),
AirCraftCategory = str_replace_all(AirCraftCategory, "\\s+", ""),
AirCraftCategory = str_replace_all(AirCraftCategory, ",+", ","),
AirCraftCategory = str_replace_all(AirCraftCategory, "^,|,$", ""),
AirCraftCategory = ifelse(AirCraftCategory == "" | AirCraftCategory == ",", NA, AirCraftCategory),
AirCraftCategory = str_split(AirCraftCategory, ",", simplify = TRUE)[, 1],
AirCraftCategory = category_map[AirCraftCategory],
AirCraftCategory = ifelse(is.na(AirCraftCategory), "UNKNOWN", AirCraftCategory)
)
# Convert categorical columns to factors
categorical_cols <- c("AirCraftCategory", "WeatherCondition", "PurposeOfFlight", "PurposeOfFlight_Full",
"Make", "Model", "Country", "State", "ReportStatus")
crash_data <- crash_data %>%
mutate(across(all_of(categorical_cols), as.factor))
# Check missing value summary after preprocessing
missing_summary <- sapply(crash_data, function(x) sum(is.na(x)))
missing_df_sorted <- data.frame(Column = names(missing_summary), MissingCount = missing_summary) %>%
arrange(desc(MissingCount))
print(missing_df_sorted)
## Column MissingCount
## OriginalPublishDate OriginalPublishDate 6176
## NtsbNo NtsbNo 0
## EventType EventType 0
## Mkey Mkey 0
## EventDate EventDate 0
## City City 0
## State State 0
## Country Country 0
## ReportNo ReportNo 0
## N N 0
## HasSafetyRec HasSafetyRec 0
## ReportType ReportType 0
## HighestInjuryLevel HighestInjuryLevel 0
## FatalInjuryCount FatalInjuryCount 0
## SeriousInjuryCount SeriousInjuryCount 0
## MinorInjuryCount MinorInjuryCount 0
## ProbableCause ProbableCause 0
## Latitude Latitude 0
## Longitude Longitude 0
## Make Make 0
## Model Model 0
## AirCraftCategory AirCraftCategory 0
## AirportID AirportID 0
## AirportName AirportName 0
## AmateurBuilt AmateurBuilt 0
## NumberOfEngines NumberOfEngines 0
## Scheduled Scheduled 0
## PurposeOfFlight PurposeOfFlight 0
## FAR FAR 0
## AirCraftDamage AirCraftDamage 0
## WeatherCondition WeatherCondition 0
## Operator Operator 0
## ReportStatus ReportStatus 0
## RepGenFlag RepGenFlag 0
## PurposeOfFlight_Full PurposeOfFlight_Full 0
## NumberOfEngines_clean NumberOfEngines_clean 0
# Ensure AtAirport is created (add this if missing from preprocessing)
if (!"AtAirport" %in% names(crash_data)) {
crash_data <- crash_data %>%
mutate(
AtAirport = ifelse(!is.na(AirportID) & AirportID != "None", TRUE, FALSE),
AtAirport = factor(AtAirport)
)
}
print("AtAirport column created/verified.")
## [1] "AtAirport column created/verified."
# Save cleaned data for future use
write_csv(crash_data, "C:/PaNDa/not possible/CAP_482/Project_DataSets/aviation_cleaned.csv")
View(crash_data)
crash_data <- crash_data %>% select(-OriginalPublishDate)
missing_summary <- sapply(crash_data, function(x) sum(is.na(x)))
missing_df <- data.frame(Column = names(missing_summary), MissingCount = missing_summary)
missing_df_sorted <- missing_df[order(-missing_df$MissingCount), ]
print(missing_df_sorted)
## Column MissingCount
## NtsbNo NtsbNo 0
## EventType EventType 0
## Mkey Mkey 0
## EventDate EventDate 0
## City City 0
## State State 0
## Country Country 0
## ReportNo ReportNo 0
## N N 0
## HasSafetyRec HasSafetyRec 0
## ReportType ReportType 0
## HighestInjuryLevel HighestInjuryLevel 0
## FatalInjuryCount FatalInjuryCount 0
## SeriousInjuryCount SeriousInjuryCount 0
## MinorInjuryCount MinorInjuryCount 0
## ProbableCause ProbableCause 0
## Latitude Latitude 0
## Longitude Longitude 0
## Make Make 0
## Model Model 0
## AirCraftCategory AirCraftCategory 0
## AirportID AirportID 0
## AirportName AirportName 0
## AmateurBuilt AmateurBuilt 0
## NumberOfEngines NumberOfEngines 0
## Scheduled Scheduled 0
## PurposeOfFlight PurposeOfFlight 0
## FAR FAR 0
## AirCraftDamage AirCraftDamage 0
## WeatherCondition WeatherCondition 0
## Operator Operator 0
## ReportStatus ReportStatus 0
## RepGenFlag RepGenFlag 0
## PurposeOfFlight_Full PurposeOfFlight_Full 0
## NumberOfEngines_clean NumberOfEngines_clean 0
## AtAirport AtAirport 0
Interpretation:
OriginalPublishDate were removed.cat_cols <- names(crash_data)[sapply(crash_data, is.factor)]
num_cols <- names(crash_data)[sapply(crash_data, is.numeric)]
cat("Categorical columns:\n")
## Categorical columns:
print(cat_cols)
## [1] "State" "Country" "Make"
## [4] "Model" "AirCraftCategory" "PurposeOfFlight"
## [7] "WeatherCondition" "ReportStatus" "PurposeOfFlight_Full"
## [10] "AtAirport"
cat("Numerical columns:\n")
## Numerical columns:
print(num_cols)
## [1] "Mkey" "FatalInjuryCount" "SeriousInjuryCount"
## [4] "MinorInjuryCount" "Latitude" "Longitude"
## [7] "NumberOfEngines_clean"
Interpretation:
crash_data <- crash_data %>% mutate(Year = as.numeric(format(EventDate, "%Y")))
accident_rates_by_year <- crash_data %>% count(Year) %>% arrange(desc(n))
View(accident_rates_by_year)
Interpretation:
avg_fatalities_by_category <- crash_data %>%
group_by(AirCraftCategory) %>%
summarise(AvgFatalities = mean(FatalInjuryCount, na.rm = TRUE)) %>%
arrange(desc(AvgFatalities))
View(avg_fatalities_by_category)
Interpretation:
accidents_per_state <- crash_data %>%
group_by(State) %>%
summarise(TotalFatalities = sum(FatalInjuryCount, na.rm = TRUE)) %>%
arrange(desc(TotalFatalities)) %>%
head(10)
print(accidents_per_state)
## # A tibble: 10 × 2
## State TotalFatalities
## <fct> <dbl>
## 1 California 16400
## 2 Florida 860
## 3 Texas 836
## 4 New York 713
## 5 Alaska 507
## 6 Arizona 443
## 7 Colorado 418
## 8 Georgia 388
## 9 North Carolina 296
## 10 Utah 272
Interpretation:
crash_data <- crash_data %>%
mutate(
FatalityRate = (FatalInjuryCount / (FatalInjuryCount + SeriousInjuryCount + MinorInjuryCount)) * 100,
IncidentSeverity = case_when(
FatalInjuryCount > 0 ~ "Catastrophic",
SeriousInjuryCount > 0 ~ "Serious",
MinorInjuryCount > 0 ~ "Minor",
TRUE ~ "No Injury"
)
) %>%
mutate(IncidentSeverity = as.factor(IncidentSeverity))
Interpretation:
FatalityRate
(fatalities per total injuries) and IncidentSeverity
(categorical severity).weather_dist <- crash_data %>%
count(WeatherCondition) %>%
arrange(desc(n))
print(weather_dist)
## # A tibble: 3 × 2
## WeatherCondition n
## <fct> <int>
## 1 VMC 41905
## 2 IMC 2254
## 3 Unknown 348
Interpretation:
multi_engine_incidents <- crash_data %>%
filter(!is.na(NumberOfEngines_clean) & NumberOfEngines_clean != "") %>%
count(NumberOfEngines_clean) %>%
arrange(desc(n))
print(multi_engine_incidents)
## # A tibble: 6 × 2
## NumberOfEngines_clean n
## <dbl> <int>
## 1 1 38830
## 2 2 5420
## 3 4 159
## 4 3 88
## 5 6 7
## 6 8 3
Interpretation:
fatal_incidents_per_manufacturer <- crash_data %>%
group_by(Make) %>%
summarise(Fatal_Incidents = sum(FatalInjuryCount, na.rm = TRUE)) %>%
arrange(desc(Fatal_Incidents)) %>%
head(10)
print(fatal_incidents_per_manufacturer)
## # A tibble: 10 × 2
## Make Fatal_Incidents
## <fct> <dbl>
## 1 BOEING 4950
## 2 CESSNA 4253
## 3 PIPER 2759
## 4 BEECH 1802
## 5 AIRBUS 1330
## 6 AIRBUS INDUSTRIE 1088
## 7 BELL 738
## 8 ROBINSON 548
## 9 MOONEY 285
## 10 MCDONNELL DOUGLAS 264
Interpretation:
accident_rate_by_purpose <- crash_data %>%
group_by(PurposeOfFlight_Full) %>%
summarise(Total_Incidents = n()) %>%
arrange(desc(Total_Incidents))
print(accident_rate_by_purpose)
## # A tibble: 22 × 2
## PurposeOfFlight_Full Total_Incidents
## <fct> <int>
## 1 Personal 31115
## 2 Instructional 5272
## 3 Aerial Application 1891
## 4 Business 1172
## 5 Positioning 968
## 6 Unknown 846
## 7 Other Work 693
## 8 Aerial Observation/Surveillance 466
## 9 Flight Test 464
## 10 Public Utility 242
## # ℹ 12 more rows
Interpretation:
survival_rate <- crash_data %>%
mutate(Survival = (1 - (FatalInjuryCount / (FatalInjuryCount + SeriousInjuryCount + MinorInjuryCount))) * 100)
cat("Average survival rate:", mean(survival_rate$Survival, na.rm = TRUE), "%\n")
## Average survival rate: 58.05775 %
Interpretation:
report_percentage <- crash_data %>%
summarise(ReportPublished = sum(!is.na(ReportStatus)) / n() * 100)
cat("Percentage with official report:", report_percentage$ReportPublished, "%\n")
## Percentage with official report: 100 %
Interpretation:
common_causes <- crash_data %>%
group_by(ProbableCause) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
slice_head(n = 10)
View(common_causes)
Interpretation:
crash_data$AmateurBuilt <- case_when(
crash_data$AmateurBuilt == "TRUE" ~ TRUE,
crash_data$AmateurBuilt == "FALSE" ~ FALSE,
TRUE ~ NA
)
crash_data_amateur <- crash_data %>% filter(!is.na(AmateurBuilt))
t_amateur <- t.test(FatalInjuryCount ~ AmateurBuilt, data = crash_data_amateur)
print(t_amateur)
##
## Welch Two Sample t-test
##
## data: FatalInjuryCount by AmateurBuilt
## t = 10.462, df = 43580, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## 0.2513898 0.3672959
## sample estimates:
## mean in group FALSE mean in group TRUE
## 0.6364457 0.3271028
Interpretation:
crash_data_hemi <- crash_data %>%
filter(!is.na(Latitude) & !is.na(FatalInjuryCount)) %>%
mutate(Hemisphere = ifelse(Latitude >= 0, "Northern", "Southern"))
t_hemi <- t.test(FatalInjuryCount ~ Hemisphere, data = crash_data_hemi)
print(t_hemi)
##
## Welch Two Sample t-test
##
## data: FatalInjuryCount by Hemisphere
## t = -6.1537, df = 899.67, p-value = 1.139e-09
## alternative hypothesis: true difference in means between group Northern and group Southern is not equal to 0
## 95 percent confidence interval:
## -1.6733180 -0.8640648
## sample estimates:
## mean in group Northern mean in group Southern
## 0.5848555 1.8535469
Interpretation:
fatality_airport <- crash_data %>%
filter(!is.na(AirportID) & AirportID != "None") %>%
group_by(AirportID) %>%
summarise(AvgFatalities = mean(FatalInjuryCount, na.rm = TRUE)) %>%
arrange(desc(AvgFatalities))
print(fatality_airport)
## # A tibble: 7,690 × 2
## AirportID AvgFatalities
## <chr> <dbl>
## 1 OPRN 157
## 2 FMCH 152
## 3 URSS 113
## 4 MUHA 112
## 5 OLBA 90
## 6 XUBS 89
## 7 CGK 62
## 8 RCQC 58
## 9 WIHH 44
## 10 ZYLD 42
## # ℹ 7,680 more rows
Interpretation:
model_engines <- lm(FatalityRate ~ NumberOfEngines_clean, data = crash_data)
summary(model_engines)
##
## Call:
## lm(formula = FatalityRate ~ NumberOfEngines_clean, data = crash_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.98 -41.31 -41.31 58.69 58.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.2128 1.1065 32.727 < 2e-16 ***
## NumberOfEngines_clean 5.0958 0.9381 5.432 5.63e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.59 on 20217 degrees of freedom
## (24288 observations deleted due to missingness)
## Multiple R-squared: 0.001458, Adjusted R-squared: 0.001408
## F-statistic: 29.51 on 1 and 20217 DF, p-value: 5.628e-08
Interpretation:
NumberOfEngines and
FatalityRate quantified how propulsion affects safety.model_predict <- lm(FatalityRate ~ WeatherCondition + PurposeOfFlight_Full + NumberOfEngines_clean, data = crash_data)
summary(model_predict)
##
## Call:
## lm(formula = FatalityRate ~ WeatherCondition + PurposeOfFlight_Full +
## NumberOfEngines_clean, data = crash_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -95.27 -39.38 -29.25 60.62 83.16
##
## Coefficients:
## Estimate
## (Intercept) 65.5886
## WeatherConditionUnknown -5.7228
## WeatherConditionVMC -33.4040
## PurposeOfFlight_FullAerial Drop (cargo, supplies, etc.) 10.2931
## PurposeOfFlight_FullAerial Observation/Surveillance 4.4067
## PurposeOfFlight_FullAerial Show 21.4859
## PurposeOfFlight_FullBanner/Advertising -3.0696
## PurposeOfFlight_FullBusiness 9.2073
## PurposeOfFlight_FullExcluded (possibly military/confidential) 0.6202
## PurposeOfFlight_FullExecutive 9.8481
## PurposeOfFlight_FullFerry 12.9533
## PurposeOfFlight_FullFiring 21.4764
## PurposeOfFlight_FullFlight Test 6.3336
## PurposeOfFlight_FullGliding 9.6097
## PurposeOfFlight_FullInstructional -3.3235
## PurposeOfFlight_FullOther Work -3.4462
## PurposeOfFlight_FullPersonal 6.8011
## PurposeOfFlight_FullPositioning 9.5575
## PurposeOfFlight_FullPublic -15.7351
## PurposeOfFlight_FullPublic Flight 1.3841
## PurposeOfFlight_FullPublic Service -1.8319
## PurposeOfFlight_FullPublic Utility 2.8309
## PurposeOfFlight_FullSkydiving -3.1768
## PurposeOfFlight_FullUnknown 28.1063
## NumberOfEngines_clean 0.3932
## Std. Error
## (Intercept) 2.2535
## WeatherConditionUnknown 3.2673
## WeatherConditionVMC 1.1797
## PurposeOfFlight_FullAerial Drop (cargo, supplies, etc.) 17.5484
## PurposeOfFlight_FullAerial Observation/Surveillance 3.1107
## PurposeOfFlight_FullAerial Show 4.5137
## PurposeOfFlight_FullBanner/Advertising 6.1313
## PurposeOfFlight_FullBusiness 2.5660
## PurposeOfFlight_FullExcluded (possibly military/confidential) 5.4421
## PurposeOfFlight_FullExecutive 5.1243
## PurposeOfFlight_FullFerry 4.7559
## PurposeOfFlight_FullFiring 7.7735
## PurposeOfFlight_FullFlight Test 3.4129
## PurposeOfFlight_FullGliding 8.3265
## PurposeOfFlight_FullInstructional 1.9669
## PurposeOfFlight_FullOther Work 2.8492
## PurposeOfFlight_FullPersonal 1.6539
## PurposeOfFlight_FullPositioning 2.7796
## PurposeOfFlight_FullPublic 7.6679
## PurposeOfFlight_FullPublic Flight 6.7330
## PurposeOfFlight_FullPublic Service 7.9797
## PurposeOfFlight_FullPublic Utility 4.5328
## PurposeOfFlight_FullSkydiving 5.4442
## PurposeOfFlight_FullUnknown 2.5886
## NumberOfEngines_clean 0.9364
## t value Pr(>|t|)
## (Intercept) 29.105 < 2e-16
## WeatherConditionUnknown -1.752 0.079867
## WeatherConditionVMC -28.316 < 2e-16
## PurposeOfFlight_FullAerial Drop (cargo, supplies, etc.) 0.587 0.557507
## PurposeOfFlight_FullAerial Observation/Surveillance 1.417 0.156607
## PurposeOfFlight_FullAerial Show 4.760 1.95e-06
## PurposeOfFlight_FullBanner/Advertising -0.501 0.616624
## PurposeOfFlight_FullBusiness 3.588 0.000334
## PurposeOfFlight_FullExcluded (possibly military/confidential) 0.114 0.909270
## PurposeOfFlight_FullExecutive 1.922 0.054638
## PurposeOfFlight_FullFerry 2.724 0.006462
## PurposeOfFlight_FullFiring 2.763 0.005736
## PurposeOfFlight_FullFlight Test 1.856 0.063500
## PurposeOfFlight_FullGliding 1.154 0.248468
## PurposeOfFlight_FullInstructional -1.690 0.091088
## PurposeOfFlight_FullOther Work -1.210 0.226478
## PurposeOfFlight_FullPersonal 4.112 3.93e-05
## PurposeOfFlight_FullPositioning 3.438 0.000586
## PurposeOfFlight_FullPublic -2.052 0.040174
## PurposeOfFlight_FullPublic Flight 0.206 0.837123
## PurposeOfFlight_FullPublic Service -0.230 0.818430
## PurposeOfFlight_FullPublic Utility 0.625 0.532280
## PurposeOfFlight_FullSkydiving -0.584 0.559553
## PurposeOfFlight_FullUnknown 10.858 < 2e-16
## NumberOfEngines_clean 0.420 0.674561
##
## (Intercept) ***
## WeatherConditionUnknown .
## WeatherConditionVMC ***
## PurposeOfFlight_FullAerial Drop (cargo, supplies, etc.)
## PurposeOfFlight_FullAerial Observation/Surveillance
## PurposeOfFlight_FullAerial Show ***
## PurposeOfFlight_FullBanner/Advertising
## PurposeOfFlight_FullBusiness ***
## PurposeOfFlight_FullExcluded (possibly military/confidential)
## PurposeOfFlight_FullExecutive .
## PurposeOfFlight_FullFerry **
## PurposeOfFlight_FullFiring **
## PurposeOfFlight_FullFlight Test .
## PurposeOfFlight_FullGliding
## PurposeOfFlight_FullInstructional .
## PurposeOfFlight_FullOther Work
## PurposeOfFlight_FullPersonal ***
## PurposeOfFlight_FullPositioning ***
## PurposeOfFlight_FullPublic *
## PurposeOfFlight_FullPublic Flight
## PurposeOfFlight_FullPublic Service
## PurposeOfFlight_FullPublic Utility
## PurposeOfFlight_FullSkydiving
## PurposeOfFlight_FullUnknown ***
## NumberOfEngines_clean
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.22 on 20194 degrees of freedom
## (24288 observations deleted due to missingness)
## Multiple R-squared: 0.05906, Adjusted R-squared: 0.05795
## F-statistic: 52.82 on 24 and 20194 DF, p-value: < 2.2e-16
Interpretation:
WeatherCondition,
PurposeOfFlight, and NumberOfEngines predict
FatalityRate.model_latitude <- lm(FatalityRate ~ Latitude, data = crash_data)
summary(model_latitude)
##
## Call:
## lm(formula = FatalityRate ~ Latitude, data = crash_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.94 -41.94 -41.94 58.06 58.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.194e+01 3.349e-01 125.241 <2e-16 ***
## Latitude -9.913e-07 1.151e-06 -0.861 0.389
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.62 on 20217 degrees of freedom
## (24288 observations deleted due to missingness)
## Multiple R-squared: 3.668e-05, Adjusted R-squared: -1.278e-05
## F-statistic: 0.7416 on 1 and 20217 DF, p-value: 0.3892
Interpretation:
anova_severity <- aov(FatalityRate ~ IncidentSeverity, data = crash_data)
summary(anova_severity)
## Df Sum Sq Mean Sq F value Pr(>F)
## IncidentSeverity 2 42220284 21110142 117630 <2e-16 ***
## Residuals 20216 3627998 179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 24288 observations deleted due to missingness
Interpretation:
IncidentSeverity categories differ
significantly in fatality rate.crash_data$AirCraftDamage <- as.factor(crash_data$AirCraftDamage)
anova_damage <- aov(FatalityRate ~ AirCraftDamage, data = crash_data)
summary(anova_damage)
## Df Sum Sq Mean Sq F value Pr(>F)
## AirCraftDamage 19 12612912 663837 403.5 <2e-16 ***
## Residuals 20199 33235370 1645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 24288 observations deleted due to missingness
Interpretation:
AirCraftDamage
levels.crash_data <- crash_data %>%
mutate(EventMonth = as.numeric(format(EventDate, "%m")))
accidents_by_month <- crash_data %>%
count(EventMonth)
p22 <- ggplot(accidents_by_month, aes(x = factor(EventMonth, levels = 1:12), y = n)) +
geom_bar(stat = "identity", fill = "purple") +
labs(title = "Monthly Distribution of Aviation Incidents",
x = "Month", y = "Number of Incidents") +
theme_minimal()
print(p22)
Interpretation:
p23 <- ggplot(crash_data, aes(x = FatalityRate, fill = AirCraftCategory)) +
geom_histogram(binwidth = 5, alpha = 0.7, position = "dodge") +
labs(title = "Fatality Rate Distribution by Aircraft Category", x = "Fatality Rate (%)", y = "Count") +
scale_fill_brewer(palette = "Set3") +
theme_minimal()
print(p23)
## Warning: Removed 24288 rows containing non-finite outside the scale range
## (`stat_bin()`).
Interpretation:
FatalityRate across
aircraft categories.top_states <- crash_data %>%
count(State, sort = TRUE) %>%
head(10)
p24 <- ggplot(top_states, aes(x = reorder(State, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Top 10 States by Number of Aviation Accidents", x = "State", y = "Number of Accidents") +
coord_flip() +
theme_minimal()
print(p24)
Interpretation:
fatalities_by_year <- crash_data %>%
group_by(Year) %>%
summarise(TotalFatalities = sum(FatalInjuryCount, na.rm = TRUE), .groups = "drop")
p25 <- ggplot(fatalities_by_year, aes(x = Year, y = TotalFatalities)) +
geom_point(color = "red", alpha = 0.6) +
geom_smooth(method = "loess", se = TRUE) +
labs(title = "Total Fatalities Trend Over Years", x = "Year", y = "Total Fatalities") +
theme_minimal()
print(p25)
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:
top_purposes <- crash_data %>%
count(PurposeOfFlight_Full, sort = TRUE) %>%
head(5) %>%
pull(PurposeOfFlight_Full)
crash_subset <- crash_data %>% filter(PurposeOfFlight_Full %in% top_purposes)
p26 <- ggplot(crash_subset, aes(x = PurposeOfFlight_Full, y = FatalityRate)) +
geom_boxplot(fill = "lightgreen") +
labs(title = "Fatality Rate by Top 5 Flight Purposes", x = "Purpose of Flight", y = "Fatality Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p26)
## Warning: Removed 22504 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Interpretation:
FatalityRate by top 5 flight purposes were
generated.monthly_category <- crash_data %>%
count(EventMonth, AirCraftCategory) %>%
complete(EventMonth = 1:12, AirCraftCategory, fill = list(n = 0)) %>%
pivot_wider(names_from = AirCraftCategory, values_from = n, values_fill = 0)
p27 <- ggplot(crash_data %>% count(EventMonth, AirCraftCategory), aes(x = factor(EventMonth, levels = 1:12), y = AirCraftCategory, fill = n)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightblue", high = "darkred") +
labs(title = "Heatmap: Accidents by Month and Aircraft Category", x = "Month", y = "Aircraft Category", fill = "Count") +
theme_minimal()
print(p27)
Interpretation:
cluster_data <- crash_data %>%
select(FatalityRate, SeriousInjuryCount, MinorInjuryCount) %>%
filter(complete.cases(.))
set.seed(123)
kmeans_result <- kmeans(cluster_data, centers = 3)
cluster_data$cluster <- as.factor(kmeans_result$cluster)
ggplot(cluster_data, aes(x = FatalityRate, y = SeriousInjuryCount, color = cluster)) +
geom_point(alpha = 0.6) +
labs(title = "K-Means Clustering of Aviation Incidents")
Interpretation:
cluster_data1 <- crash_data %>%
select(FatalityRate, SeriousInjuryCount, MinorInjuryCount) %>%
filter(complete.cases(.))
set.seed(123)
kmeans_result1 <- kmeans(cluster_data1, centers = 3)
cluster_data1$cluster <- as.factor(kmeans_result1$cluster)
p29 <- ggplot(cluster_data1, aes(x = FatalityRate, y = SeriousInjuryCount, color = cluster)) +
geom_point(alpha = 0.6) +
labs(title = "K-Means Clustering: Injury Metrics") +
theme_minimal()
print(p29)
print("Cluster Centers (Q29):")
## [1] "Cluster Centers (Q29):"
print(kmeans_result1$centers)
## FatalityRate SeriousInjuryCount MinorInjuryCount
## 1 0.9516408 0.6929716 1.13669191
## 2 94.5035062 0.1548708 0.03747601
## 3 3.1409188 22.6666667 162.50000000
Interpretation:
spatial_data <- crash_data %>%
select(Year, Latitude, Longitude) %>%
filter(complete.cases(.))
set.seed(123)
kmeans_result2 <- kmeans(spatial_data, centers = 4)
spatial_data$cluster <- as.factor(kmeans_result2$cluster)
p30 <- ggplot(spatial_data, aes(x = Latitude, y = Longitude, color = cluster)) +
geom_point(alpha = 0.5) +
labs(title = "K-Means Clustering: Spatial and Temporal Patterns") +
theme_minimal()
print(p30)
print("Spatial Cluster Centers (Q30):")
## [1] "Spatial Cluster Centers (Q30):"
print(kmeans_result2$centers)
## Year Latitude Longitude
## 1 2021.125 1.465587e+05 680928.78000
## 2 2022.000 1.424632e+05 -926798.83000
## 3 2021.667 3.685580e+07 -334741.81297
## 4 2011.346 3.372937e+01 -84.70322
Interpretation:
wss <- sapply(1:10, function(k) {
kmeans(cluster_data1, centers = k, nstart = 10)$tot.withinss
})
p31 <- ggplot(tibble(k = 1:10, WSS = wss), aes(x = k, y = WSS)) +
geom_line() + geom_point() +
labs(title = "Elbow Method for Optimal K", x = "Number of Clusters (k)", y = "Within-Cluster Sum of Squares") +
theme_minimal()
print(p31)
optimal_k <- which.min(diff(wss)) + 1 # Simple elbow detection
print(paste("Suggested Optimal K (Q35):", optimal_k))
## [1] "Suggested Optimal K (Q35): 2"
Interpretation:
sil <- silhouette(kmeans_result1$cluster, dist(cluster_data1))
p36 <- plot(sil, main = "Silhouette Plot for K=3 Clusters")
print("Silhouette Analysis (Q36): Average Silhouette Width")
## [1] "Silhouette Analysis (Q36): Average Silhouette Width"
print(mean(sil[, 3]))
## [1] 0.913549
Interpretation:
# Step 1: Prepare features
features_knn <- crash_data %>%
select(FatalityRate, NumberOfEngines_clean) %>%
mutate(
FatalityRate = as.numeric(FatalityRate),
NumberOfEngines_clean = as.numeric(NumberOfEngines_clean)
) %>%
drop_na()
# Step 2: Match labels correctly
labels_knn <- crash_data$IncidentSeverity[as.numeric(rownames(features_knn))]
# Step 3: Add jitter to prevent identical values (avoid KNN ties)
set.seed(123)
features_knn <- features_knn %>%
mutate(
FatalityRate = FatalityRate + runif(n(), -0.001, 0.001),
NumberOfEngines_clean = NumberOfEngines_clean + runif(n(), -0.01, 0.01)
)
# Step 4: Normalize features (essential for KNN)
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
features_knn <- features_knn %>% mutate(across(everything(), normalize))
# Step 5: Train-test split (70/30)
set.seed(123)
n <- nrow(features_knn)
train_idx <- sample(1:n, 0.7 * n)
train_features <- features_knn[train_idx, ]
test_features <- features_knn[-train_idx, ]
train_labels <- labels_knn[train_idx]
test_labels <- labels_knn[-train_idx]
# Step 6: Run KNN (no ties now)
knn_pred <- knn(
train = train_features,
test = test_features,
cl = train_labels,
k = 5
)
# Step 7: Evaluate Model
conf_mat <- table(Predicted = knn_pred, Actual = test_labels)
accuracy <- sum(diag(conf_mat)) / sum(conf_mat) * 100
# Print results
print("KNN Confusion Matrix:")
## [1] "KNN Confusion Matrix:"
print(conf_mat)
## Actual
## Predicted Catastrophic Minor No Injury Serious
## Catastrophic 187 99 479 94
## Minor 88 65 219 56
## No Injury 970 611 2476 458
## Serious 60 34 141 29
cat("Accuracy:", round(accuracy, 2), "%\n")
## Accuracy: 45.45 %
Interpretation:
FatalityRate
and NumberOfEngines.# 1. Select features
features <- crash_data %>%
select(FatalityRate, NumberOfEngines_clean, Latitude) %>%
mutate(across(everything(), as.numeric)) %>%
drop_na()
labels <- crash_data$AtAirport[complete.cases(crash_data[, c("FatalityRate", "NumberOfEngines_clean", "Latitude")])]
# 2. Normalize + tiny jitter (tie fix)
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
set.seed(123)
features <- features %>%
mutate(
FatalityRate = normalize(FatalityRate) + runif(n(), -0.002, 0.002),
NumberOfEngines_clean = normalize(NumberOfEngines_clean) + runif(n(), -0.01, 0.01),
Latitude = normalize(Latitude) + runif(n(), -0.002, 0.002)
)
# 3. Train-test split (70/30)
set.seed(123)
n <- nrow(features)
train_idx <- sample(1:n, 0.7 * n)
train_feat <- features[train_idx, ]
test_feat <- features[-train_idx, ]
train_lab <- labels[train_idx]
test_lab <- labels[-train_idx]
# 4. KNN
knn_pred <- knn(train_feat, test_feat, train_lab, k = 5)
# 5. Results
conf_mat <- table(Predicted = knn_pred, Actual = test_lab)
accuracy <- sum(diag(conf_mat)) / sum(conf_mat) * 100
print(conf_mat)
## Actual
## Predicted FALSE TRUE
## FALSE 0 0
## TRUE 25 6041
cat("Accuracy:", round(accuracy, 2), "%\n")
## Accuracy: 99.59 %
Interpretation:
rules_data <- crash_data %>%
select(PurposeOfFlight_Full, WeatherCondition, AirCraftDamage, IncidentSeverity) %>%
mutate(across(everything(), as.factor)) %>%
na.omit()
trans_list <- split(rules_data, seq(nrow(rules_data)))
trans_list <- lapply(trans_list, function(x) {
as.character(unlist(x))
})
transactions <- as(trans_list, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
rules <- apriori(
transactions,
parameter = list(supp = 0.005, conf = 0.6, maxlen = 4)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 4 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 222
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[54 item(s), 44507 transaction(s)] done [0.01s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [201 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
top_rules <- sort(rules, by = "lift")[1:15]
inspect(top_rules)
## lhs rhs support confidence
## [1] {Catastrophic, IMC} => {Destroyed} 0.021007931 0.6885125
## [2] {Catastrophic, IMC, Personal} => {Destroyed} 0.016289572 0.6782039
## [3] {Destroyed, IMC, Personal} => {Catastrophic} 0.016289572 0.9564644
## [4] {Destroyed, IMC} => {Catastrophic} 0.021007931 0.9482759
## [5] {Destroyed, Unknown} => {Catastrophic} 0.006291145 0.8777429
## [6] {Destroyed, Personal} => {Catastrophic} 0.078684252 0.8026587
## [7] {Destroyed} => {Catastrophic} 0.108140293 0.7881120
## [8] {Destroyed, Personal, VMC} => {Catastrophic} 0.060956703 0.7666007
## [9] {Destroyed, VMC} => {Catastrophic} 0.085020334 0.7539350
## [10] {Destroyed, Instructional} => {Catastrophic} 0.005706967 0.7383721
## [11] {Destroyed, Instructional, VMC} => {Catastrophic} 0.005347473 0.7300613
## [12] {IMC, Personal} => {Catastrophic} 0.024018694 0.6258782
## [13] {IMC} => {Catastrophic} 0.030512054 0.6024845
## [14] {Instructional, Substantial, VMC} => {No Injury} 0.077246276 0.7343016
## [15] {Instructional, Substantial} => {No Injury} 0.077740580 0.7321202
## coverage lift count
## [1] 0.030512054 5.017787 935
## [2] 0.024018694 4.942660 725
## [3] 0.017031029 4.604582 725
## [4] 0.022153819 4.565161 935
## [5] 0.007167412 4.225604 280
## [6] 0.098029523 3.864135 3502
## [7] 0.137214371 3.794105 4813
## [8] 0.079515582 3.690546 2713
## [9] 0.112768778 3.629571 3784
## [10] 0.007729121 3.554649 254
## [11] 0.007324690 3.514639 238
## [12] 0.038375986 3.013084 1069
## [13] 0.050643719 2.900463 1358
## [14] 0.105196935 1.345585 3438
## [15] 0.106185544 1.341587 3460
Interpretation:
PurposeOfFlight, WeatherCondition,
AirCraftDamage, and IncidentSeverity.high_fatality_data <- crash_data %>%
filter(FatalityRate > 50) %>%
select(PurposeOfFlight_Full, WeatherCondition, AirCraftDamage, IncidentSeverity) %>%
mutate(across(everything(), as.factor)) %>%
na.omit()
trans_list_high <- split(high_fatality_data, seq(nrow(high_fatality_data)))
trans_list_high <- lapply(trans_list_high, function(x) {
as.character(unlist(x))
})
transactions_high <- as(trans_list_high, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
rules_high <- apriori(
transactions_high,
parameter = list(supp = 0.005, conf = 0.6, maxlen = 4)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 4 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 40
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[43 item(s), 8102 transaction(s)] done [0.00s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [176 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
top_rules_high <- sort(rules_high, by = "lift")[1:15]
inspect(top_rules_high)
## lhs rhs support
## [1] {Business, IMC} => {Destroyed} 0.008639842
## [2] {Business, Catastrophic, IMC} => {Destroyed} 0.008639842
## [3] {IMC, Positioning} => {Destroyed} 0.005677610
## [4] {Catastrophic, IMC, Positioning} => {Destroyed} 0.005677610
## [5] {IMC} => {Destroyed} 0.109479141
## [6] {Catastrophic, IMC} => {Destroyed} 0.109479141
## [7] {IMC, Personal} => {Destroyed} 0.085164157
## [8] {Catastrophic, IMC, Personal} => {Destroyed} 0.085164157
## [9] {Business} => {Destroyed} 0.022463589
## [10] {Business, Catastrophic} => {Destroyed} 0.022463589
## [11] {Flight Test} => {VMC} 0.011231795
## [12] {Destroyed, Flight Test} => {VMC} 0.006418168
## [13] {Catastrophic, Flight Test} => {VMC} 0.011231795
## [14] {Catastrophic, Destroyed, Flight Test} => {VMC} 0.006418168
## [15] {Positioning} => {Destroyed} 0.016539126
## confidence coverage lift count
## [1] 0.7777778 0.011108368 1.421190 70
## [2] 0.7777778 0.011108368 1.421190 70
## [3] 0.7540984 0.007529005 1.377922 46
## [4] 0.7540984 0.007529005 1.377922 46
## [5] 0.7011858 0.156134288 1.281238 887
## [6] 0.7011858 0.156134288 1.281238 887
## [7] 0.6955645 0.122438904 1.270966 690
## [8] 0.6955645 0.122438904 1.270966 690
## [9] 0.6893939 0.032584547 1.259691 182
## [10] 0.6893939 0.032584547 1.259691 182
## [11] 1.0000000 0.011231795 1.212693 91
## [12] 1.0000000 0.006418168 1.212693 52
## [13] 1.0000000 0.011231795 1.212693 91
## [14] 1.0000000 0.006418168 1.212693 52
## [15] 0.6536585 0.025302394 1.194394 134
Interpretation:
top_states <- crash_data %>%
count(City, sort = TRUE) %>%
head(5) %>%
pull(City)
state_data <- crash_data %>%
filter(City %in% top_states) %>%
select(PurposeOfFlight_Full, WeatherCondition, AirCraftDamage, IncidentSeverity) %>%
mutate(across(everything(), as.factor)) %>%
na.omit()
trans_list_state <- split(state_data, seq(nrow(state_data)))
trans_list_state <- lapply(trans_list_state, function(x) {
as.character(unlist(x))
})
transactions_state <- as(trans_list_state, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
rules_state <- apriori(
transactions_state,
parameter = list(supp = 0.005, conf = 0.6, maxlen = 4)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 4 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 3
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[34 item(s), 699 transaction(s)] done [0.00s].
## sorting and recoding items ... [24 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [212 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
top_rules_state <- sort(rules_state, by = "lift")[1:15]
inspect(top_rules_state)
## lhs rhs support confidence
## [1] {Catastrophic, IMC, Personal} => {Destroyed} 0.007153076 0.7142857
## [2] {Catastrophic, IMC} => {Destroyed} 0.010014306 0.7000000
## [3] {Destroyed, IMC} => {Catastrophic} 0.010014306 1.0000000
## [4] {Destroyed, IMC, Personal} => {Catastrophic} 0.007153076 1.0000000
## [5] {IMC, Serious} => {None} 0.005722461 0.8000000
## [6] {IMC, Personal, Serious} => {None} 0.005722461 0.8000000
## [7] {Destroyed, Personal} => {Catastrophic} 0.025751073 0.8571429
## [8] {Destroyed, VMC} => {Catastrophic} 0.025751073 0.8571429
## [9] {Destroyed} => {Catastrophic} 0.035765379 0.8333333
## [10] {Destroyed, Personal, VMC} => {Catastrophic} 0.018597997 0.8125000
## [11] {IMC, None, Personal} => {Serious} 0.005722461 0.8000000
## [12] {Personal, Unknown} => {None} 0.007153076 0.6250000
## [13] {IMC, None} => {Serious} 0.005722461 0.6666667
## [14] {None,None} => {No Injury} 0.005722461 1.0000000
## [15] {Substantial,Minor} => {No Injury} 0.005722461 1.0000000
## coverage lift count
## [1] 0.010014306 16.642857 5
## [2] 0.014306152 16.310000 7
## [3] 0.010014306 11.459016 7
## [4] 0.007153076 11.459016 5
## [5] 0.007153076 10.753846 4
## [6] 0.007153076 10.753846 4
## [7] 0.030042918 9.822014 18
## [8] 0.030042918 9.822014 18
## [9] 0.042918455 9.549180 25
## [10] 0.022889843 9.310451 13
## [11] 0.007153076 9.167213 4
## [12] 0.011444921 8.401442 5
## [13] 0.008583691 7.639344 4
## [14] 0.005722461 1.389662 4
## [15] 0.005722461 1.389662 4
Interpretation:
FatalityRate and
IncidentSeverity to quantify accident severity and
outcomes.FatalityRate
exist across categories such as:
Across all analyses, multiple operational, environmental, and technical factors — including aircraft category, flight purpose, weather conditions, and damage levels — strongly influence aviation accident outcomes.
This data-driven study underscores the need for evidence-based interventions to reduce aviation risks and improve safety standards.
Aircraft Category Matters:
Smaller and personal aircraft exhibit a higher tendency for severe or
fatal incidents.
Flight Purpose Impact:
Personal and non-commercial flights consistently record higher accident
severity compared to business or public operations.
Weather Conditions are Critical:
IMC (poor visibility) conditions significantly raise the probability of
severe or fatal crashes.
Aircraft Damage Severity:
Greater damage correlates strongly with higher fatality rates — a clear
physical indicator of crash severity.
Geographical Variations:
Certain U.S. states show clustering of incidents due to higher flight
density or environmental factors.
Engine Count Influence:
More engines correlate slightly with better survivability — redundancy
enhances flight safety.
Multi-Factor Relationship:
The combination of WeatherCondition,
PurposeOfFlight, and NumberOfEngines
provides stronger predictive power for fatality outcomes.
Temporal Patterns:
Despite long-term safety improvements, occasional fatality spikes
highlight continuing vulnerabilities in small aircraft and adverse
weather operations.
tidyverse,
ggplot2, class, cluster,
arules, dplyr, stats✈️ “Aviation safety is not achieved by chance — it’s achieved by data, insight, and continuous improvement.”