| Stage | Records |
|---|---|
| Original Kaggle dataset | 5268 |
| After 1940 filter | 4739 |
| Saved geocoded dataset | 3128 |
| After removing missing coordinates | 2603 |
| Weather-enriched analysis dataset | 2595 |
2026-05-12
This project examines historical aviation accident data to explore factors associated with the severity of crash outcomes. The analysis combines a structured Kaggle aviation accident dataset with historical weather data retrieved through the Open-Meteo Historical Weather API.
Since the accident dataset stores locations as free-text rather than geographic coordinates, the workflow involved narrowing the dataset to clearer location records, applying geocoding to obtain latitude and longitude values, and using those coordinates to retrieve weather variables such as mean temperature, precipitation, and maximum wind speed.
The analysis used feature engineering, exploratory visualization, and linear regression modelling to examine relationships between fatality rate and selected variables, including persons aboard, year, and weather conditions.
The regression model was statistically significant, but its adjusted R-squared value suggested that it explained only a modest portion of variation in fatality rate. Persons aboard and mean temperature were statistically significant within the model, though the persons aboard result requires caution because fatality rate was calculated using the persons aboard variable.
My interest in this topic came from the increased public attention surrounding aviation crashes, near misses, and other safety-related incidents.
While this attention may reflect increased reporting rather than an actual rise in incidents, it provided a useful motivation for examining historical crash data more closely.
Research focus:
Which selected aviation and weather-related variables appear associated with aviation accident fatality severity?
The project uses two different data source types:
Structured CSV dataset
Kaggle aviation accident records, including date, location, operator, aircraft type, persons aboard, and fatalities.
API-based weather data
Open-Meteo Historical Weather API, used to retrieve mean temperature, precipitation sum, and maximum wind speed by accident date and location.
The project followed a reproducible data science workflow:
| Stage | Records |
|---|---|
| Original Kaggle dataset | 5268 |
| After 1940 filter | 4739 |
| Saved geocoded dataset | 3128 |
| After removing missing coordinates | 2603 |
| Weather-enriched analysis dataset | 2595 |
The dataset was narrowed to records that could support the later geocoding and weather integration steps.
One challenge was that the location field was stored as text rather than latitude and longitude.
To improve geocoding reliability, vague records were removed before coordinates were retrieved.
Geocoding and retrieving weather data for thousands of records were time-consuming steps.
To keep the presentation and final report reproducible, the long-running steps were executed once, saved to CSV, uploaded to GitHub, and then re-imported during rendering.
The main response variable was fatality rate, calculated as:
\[ \text{Fatality Rate} = \frac{\text{Fatalities}}{\text{Aboard}} \]
This allowed accident severity to be examined proportionally rather than only through raw fatality totals.
The Open-Meteo Historical Weather API was queried using each accident’s latitude, longitude, and date.
A custom get_weather_data() function was created to automate the retrieval of historical weather conditions for each crash record.
Key steps:
The function was then applied across the aviation dataset using pmap(), and the returned weather variables were appended back into the crash records.
Rather than presenting every summary statistic, the main values used to understand the retained data were:
This indicates that the retained dataset is heavily represented by severe aviation accidents.
The visualization shows that many accidents in the retained dataset had very high fatality rates, with many observations clustered near 1.0.
The fitted trend line shows a slight downward pattern over time, which may point to gradual improvements in aviation safety, aircraft technology, and operational practices.
However, the wide spread of observations shows that severity continues to vary substantially across individual cases.
The visualization shows substantial variability between wind speed and fatality rate.
The fitted trend line suggests a slight positive relationship, where higher wind speeds may be associated with somewhat more severe crash outcomes on average.
However, the broad spread of observations shows that wind conditions alone do not explain accident severity.
A linear regression model was fitted to examine whether selected variables appeared associated with fatality rate.
| Variable | Estimate | p-value |
|---|---|---|
| (Intercept) | 1.31699 | 0.0455 |
| Aboard | -0.00243 | 0.0000 |
| Year | -0.00023 | 0.4900 |
| temperature_mean | -0.00131 | 0.0176 |
| precipitation_sum | 0.00027 | 0.6530 |
| wind_speed_max | 0.00072 | 0.3300 |
The overall regression model was statistically significant:
| Adjusted R-squared | F-statistic p-value | Residual standard error |
|---|---|---|
| 0.108 | < 2.2e-16 | 0.313 |
The adjusted R-squared of approximately 0.108 indicates that the model explains only a modest portion of the variation in fatality rate.
Persons aboard and mean temperature were statistically significant within the model.
However, the persons aboard result should be interpreted cautiously because fatality rate was calculated using persons aboard. Therefore, part of this relationship may reflect the mathematical structure of the response variable rather than a fully independent operational relationship.
Year, precipitation, and maximum wind speed were not statistically significant within the fitted model.
The diagnostic plots suggest that the model captures some structure, but several limitations remain.
The residuals versus fitted plot shows clustering and uneven spread, while the Q-Q plot shows departures from normality, especially in the tails.
This is not unexpected because fatality rate is bounded between 0 and 1, and aviation accident severity is highly variable.
This project combined aviation accident records, geocoding, and API-based historical weather data within a reproducible Quarto workflow.
The findings suggest that aviation accident severity is associated with substantial variability and complexity. Some variables were statistically significant, but the model explained only a modest portion of fatality rate variation.
As such, accident severity is likely associated with numerous interacting operational, environmental, mechanical, and human-related factors rather than any single contributing condition.
Grandi, S. (n.d.). Airplane crashes since 1908 [Data set]. Kaggle. https://www.kaggle.com/datasets/saurograndi/airplane-crashes-since-1908
Open-Meteo. (n.d.). Historical weather API. https://open-meteo.com/en/docs/historical-weather-api
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz/
OpenAI. (2026). ChatGPT [Large language model]. https://chat.openai.com/. Accessed May 9, 2026.