Analyzing Factors Contributing to Aviation Accident Severity

Brandon Chanderban

2026-05-12

Abstract

This project examines historical aviation accident data to explore factors associated with the severity of crash outcomes. The analysis combines a structured Kaggle aviation accident dataset with historical weather data retrieved through the Open-Meteo Historical Weather API.

Since the accident dataset stores locations as free-text rather than geographic coordinates, the workflow involved narrowing the dataset to clearer location records, applying geocoding to obtain latitude and longitude values, and using those coordinates to retrieve weather variables such as mean temperature, precipitation, and maximum wind speed.

Abstract Continued

The analysis used feature engineering, exploratory visualization, and linear regression modelling to examine relationships between fatality rate and selected variables, including persons aboard, year, and weather conditions.

The regression model was statistically significant, but its adjusted R-squared value suggested that it explained only a modest portion of variation in fatality rate. Persons aboard and mean temperature were statistically significant within the model, though the persons aboard result requires caution because fatality rate was calculated using the persons aboard variable.

Motivation and Research Question

My interest in this topic came from the increased public attention surrounding aviation crashes, near misses, and other safety-related incidents.

While this attention may reflect increased reporting rather than an actual rise in incidents, it provided a useful motivation for examining historical crash data more closely.

Research focus:
Which selected aviation and weather-related variables appear associated with aviation accident fatality severity?

Data Sources

The project uses two different data source types:

  1. Structured CSV dataset
    Kaggle aviation accident records, including date, location, operator, aircraft type, persons aboard, and fatalities.

  2. API-based weather data
    Open-Meteo Historical Weather API, used to retrieve mean temperature, precipitation sum, and maximum wind speed by accident date and location.

Workflow

The project followed a reproducible data science workflow:

  1. Import aviation accident data from GitHub
  2. Clean date, location, and numeric fields
  3. Filter records from 1940 onward for weather coverage
  4. Geocode usable free-text locations
  5. Engineer severity-related variables
  6. Retrieve historical weather variables through API calls
  7. Explore patterns visually
  8. Fit and interpret a regression model

Data Preparation

Stage Records
Original Kaggle dataset 5268
After 1940 filter 4739
Saved geocoded dataset 3128
After removing missing coordinates 2603
Weather-enriched analysis dataset 2595

The dataset was narrowed to records that could support the later geocoding and weather integration steps.

Challenge: Free-Text Locations

One challenge was that the location field was stored as text rather than latitude and longitude.

To improve geocoding reliability, vague records were removed before coordinates were retrieved.

aviation_locations <- aviation_clean %>%
  filter(!is.na(Location)) %>%
  filter(
    !str_detect(Location, regex(
      "Ocean|Sea|Gulf|River|Unknown|Near|Off",
      ignore_case = TRUE
    ))
  ) %>%
  mutate(Location = str_squish(Location))

Challenge: Geocoding and API Runtime

Geocoding and retrieving weather data for thousands of records were time-consuming steps.

To keep the presentation and final report reproducible, the long-running steps were executed once, saved to CSV, uploaded to GitHub, and then re-imported during rendering.

aviation_geocoded <- aviation_locations %>%
  geocode(
    address = Location,
    method = "osm",
    lat = latitude,
    long = longitude
  )

Feature Engineering

The main response variable was fatality rate, calculated as:

\[ \text{Fatality Rate} = \frac{\text{Fatalities}}{\text{Aboard}} \]

This allowed accident severity to be examined proportionally rather than only through raw fatality totals.

Weather API Function

The Open-Meteo Historical Weather API was queried using each accident’s latitude, longitude, and date.

A custom get_weather_data() function was created to automate the retrieval of historical weather conditions for each crash record.

Key steps:

  • Passed accident coordinates and date into the API
  • Retrieved historical daily weather observations
  • Retained mean temperature, precipitation sum, and maximum wind speed

The function was then applied across the aviation dataset using pmap(), and the returned weather variables were appended back into the crash records.

Engineered Variables Summary

Rather than presenting every summary statistic, the main values used to understand the retained data were:

  • Mean fatality rate: 0.793
  • Median fatality rate: 1.000
  • Mean persons aboard: 26.68
  • Median persons aboard: 11
  • Fatal accident indicator mean: 0.985

This indicates that the retained dataset is heavily represented by severe aviation accidents.

Fatality Rate Over Time

Interpreting the Time Trend

The visualization shows that many accidents in the retained dataset had very high fatality rates, with many observations clustered near 1.0.

The fitted trend line shows a slight downward pattern over time, which may point to gradual improvements in aviation safety, aircraft technology, and operational practices.

However, the wide spread of observations shows that severity continues to vary substantially across individual cases.

Wind Speed and Fatality Rate

Interpreting Wind Speed

The visualization shows substantial variability between wind speed and fatality rate.

The fitted trend line suggests a slight positive relationship, where higher wind speeds may be associated with somewhat more severe crash outcomes on average.

However, the broad spread of observations shows that wind conditions alone do not explain accident severity.

Regression Model

A linear regression model was fitted to examine whether selected variables appeared associated with fatality rate.

Variable Estimate p-value
(Intercept) 1.31699 0.0455
Aboard -0.00243 0.0000
Year -0.00023 0.4900
temperature_mean -0.00131 0.0176
precipitation_sum 0.00027 0.6530
wind_speed_max 0.00072 0.3300

Regression Findings

The overall regression model was statistically significant:

Adjusted R-squared F-statistic p-value Residual standard error
0.108 < 2.2e-16 0.313

The adjusted R-squared of approximately 0.108 indicates that the model explains only a modest portion of the variation in fatality rate.

Key Statistical Interpretation

Persons aboard and mean temperature were statistically significant within the model.

However, the persons aboard result should be interpreted cautiously because fatality rate was calculated using persons aboard. Therefore, part of this relationship may reflect the mathematical structure of the response variable rather than a fully independent operational relationship.

Year, precipitation, and maximum wind speed were not statistically significant within the fitted model.

Model Diagnostics

Interpreting Diagnostics

The diagnostic plots suggest that the model captures some structure, but several limitations remain.

The residuals versus fitted plot shows clustering and uneven spread, while the Q-Q plot shows departures from normality, especially in the tails.

This is not unexpected because fatality rate is bounded between 0 and 1, and aviation accident severity is highly variable.

Conclusion

This project combined aviation accident records, geocoding, and API-based historical weather data within a reproducible Quarto workflow.

The findings suggest that aviation accident severity is associated with substantial variability and complexity. Some variables were statistically significant, but the model explained only a modest portion of fatality rate variation.

As such, accident severity is likely associated with numerous interacting operational, environmental, mechanical, and human-related factors rather than any single contributing condition.

References

Grandi, S. (n.d.). Airplane crashes since 1908 [Data set]. Kaggle. https://www.kaggle.com/datasets/saurograndi/airplane-crashes-since-1908

Open-Meteo. (n.d.). Historical weather API. https://open-meteo.com/en/docs/historical-weather-api

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz/

LLM Used

OpenAI. (2026). ChatGPT [Large language model]. https://chat.openai.com/. Accessed May 9, 2026.