Final Project: Analyzing Factors Contributing to Aviation Accident Severity

Author

Brandon Chanderban

Published

May 9, 2026

Abstract

This project examines historical aviation accident data to explore factors associated with the severity of crash outcomes. The analysis combines a structured Kaggle aviation accident dataset with historical weather data retrieved through the Open-Meteo Historical Weather API. Since the accident dataset stores locations as free-text rather than geographic coordinates, the workflow involved narrowing the dataset to clearer location records, applying geocoding to obtain latitude and longitude values, and using those coordinates to retrieve weather variables such as mean temperature, precipitation, and maximum wind speed.

The analysis then used feature engineering, exploratory visualization, and linear regression modeling to examine relationships between fatality rate and selected variables, including persons aboard, year, and weather conditions. The results showed that fatality rates were generally high within the retained dataset, although a slight downward trend appeared over time. The regression model was statistically significant, but its adjusted R-squared value suggested that it explained only a modest portion of variation in fatality rate. Persons aboard and mean temperature were statistically significant within the model, though the persons aboard result should be interpreted cautiously because fatality rate was calculated using the persons aboard variable.

As such, the findings suggest that aviation accident severity is associated with multiple interacting factors beyond the limited variables included in this analysis. The project also demonstrates a reproducible workflow through Quarto and RStudio, including data cleaning, API integration, regression analysis, and presentation generation.

Introduction

This project examines historical aviation accident data to explore factors that may be associated with the severity of crash outcomes. The topic was selected due to the increased public attention surrounding aviation crashes, near misses, and other safety-related incidents. While this attention may reflect increased reporting rather than an actual rise in incidents, it still provides a useful motivation for examining historical crash data more closely.

The analysis uses two different data source types: a structured Kaggle CSV containing aviation accident records and API-based historical weather data from Open-Meteo. Since the Kaggle dataset stores locations as free-text entries rather than coordinates, the workflow will involve narrowing the dataset to usable records, applying geocoding, retrieving weather variables, and then using exploratory analysis and regression modeling to examine patterns in accident severity.

Code Base / Body

The first step involves loading the packages needed for data cleaning, geocoding, API access, visualization, and regression analysis.

Code
library(tidyverse)
library(lubridate)
library(httr2)
library(jsonlite)
library(tidygeocoder)
library(broom)
library(scales)

Importing the Aviation Accident Dataset

The aviation accident dataset was first downloaded from Kaggle and then uploaded to a personal GitHub repository. This allows the dataset to be imported directly through the raw GitHub URL, making the workflow more reproducible and avoiding references to any local file paths.

Code
aviation_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Final%20Project/airplane_crashes_and_fatalities_since_1908.csv"

aviation_raw <- read.csv(aviation_url)

glimpse(aviation_raw)
Rows: 5,268
Columns: 13
$ Date         <chr> "09/17/1908", "07/12/1912", "08/06/1913", "09/09/1913", "…
$ Time         <chr> "17:18", "6:30", "", "18:30", "10:30", "1:00", "15:20", "…
$ Location     <chr> "Fort Myer, Virginia", "AtlantiCity, New Jersey", "Victor…
$ Operator     <chr> "Military - U.S. Army", "Military - U.S. Navy", "Private"…
$ Flight..     <chr> "", "", "-", "", "", "", "", "", "", "", "", "", "", "", …
$ Route        <chr> "Demonstration", "Test flight", "", "", "", "", "", "", "…
$ Type         <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane", "Zep…
$ Registration <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "…
$ cn.In        <chr> "1", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ Aboard       <int> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23, 21, …
$ Fatalities   <int> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23, 21, …
$ Ground       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Summary      <chr> "During a demonstration flight, a U.S. Army flyer flown b…

At this stage, the dataset remains in its original imported form. The next step will involve inspecting the variable names and cleaning the date, location, and numeric fields for the analysis.

Inspecting and Preparing the Dataset

Before our analysis can be executed, the structure of the dataset must first be inspected in order to identify the variables required for the project’s scope. Particular attention will be given to the date, location, persons aboard, and fatality-related fields, since these variables will later be used for feature engineering, geocoding, weather integration, and regression analysis.

Code
names(aviation_raw)
 [1] "Date"         "Time"         "Location"     "Operator"     "Flight.."    
 [6] "Route"        "Type"         "Registration" "cn.In"        "Aboard"      
[11] "Fatalities"   "Ground"       "Summary"     
Code
summary(aviation_raw)
     Date               Time             Location           Operator        
 Length:5268        Length:5268        Length:5268        Length:5268       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   Flight..            Route               Type           Registration      
 Length:5268        Length:5268        Length:5268        Length:5268       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    cn.In               Aboard         Fatalities         Ground        
 Length:5268        Min.   :  0.00   Min.   :  0.00   Min.   :   0.000  
 Class :character   1st Qu.:  5.00   1st Qu.:  3.00   1st Qu.:   0.000  
 Mode  :character   Median : 13.00   Median :  9.00   Median :   0.000  
                    Mean   : 27.55   Mean   : 20.07   Mean   :   1.609  
                    3rd Qu.: 30.00   3rd Qu.: 23.00   3rd Qu.:   0.000  
                    Max.   :644.00   Max.   :583.00   Max.   :2750.000  
                    NA's   :22       NA's   :12       NA's   :22        
   Summary         
 Length:5268       
 Class :character  
 Mode  :character  
                   
                   
                   
                   

Cleaning and Transforming the Dataset

Before beginning the exploratory analysis, several variables must first be cleaned and transformed into formats more suitable for analysis. In particular, the date field must be converted into a usable date format, while additional variables such as year and fatality rate will later be derived through feature engineering.

Code
aviation_clean <- aviation_raw %>%
  # Convert date column into date format
  mutate(
    Date = mdy(Date),
    #Create Year variable
    Year = year(Date)
  ) %>%
  # Restrict data to years with more reliable weather coverage
  filter(Year >= 1940)

glimpse(aviation_clean)
Rows: 4,739
Columns: 14
$ Date         <date> 1940-08-09, 1941-06-03, 1940-01-15, 1940-03-01, 1940-04-…
$ Time         <chr> "", "17:00", "", "", "", "", "14:00", "", "", "10:15", ""…
$ Location     <chr> "Hannover, Germany", "AtlantiOcean", "Denpasar, Indonesia…
$ Operator     <chr> "Deutsche Lufthansa", "Great Western and Southern Air Lin…
$ Flight..     <chr> "", "", "", "", "", "", "", "", "", "", "", "19", "", "",…
$ Route        <chr> "", "", "", "Jask to Sharjah", "Perth, Scotland - London,…
$ Type         <chr> "Douglas DC-2-115H", "de Havilland DH-84 Dragon", "Lockhe…
$ Registration <chr> "D-AIAV", "G-ACPY", "PK-AFO", "G-AAGX", "G-AFKD", "", "OH…
$ cn.In        <chr> "1366", "6076", "1415", "HP42/1", "1484", "", "5494", "22…
$ Aboard       <int> 13, 6, 9, 8, 3, 5, 9, 1, NA, 10, 18, 25, 14, 15, 10, 29, …
$ Fatalities   <int> 2, 6, 8, 8, 3, 5, 9, 1, NA, 10, 14, 25, 9, 2, 10, 29, 18,…
$ Ground       <int> 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Summary      <chr> "Pilot error.", "Shot down by a He-111 German military ai…
$ Year         <dbl> 1940, 1941, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 194…

Consequently, the dataset has been narrowed to records occurring from 1940 onward, since the historical weather API used downstream in the analysis provides more reliable coverage for this period. In addition, a separate year variable has been extracted from the original date field to support later trend analysis and regression modeling.

Preparing Location Data for Geocoding

Since the weather API requires latitude and longitude coordinates, the location field must first be cleaned before geocoding can be performed. Several records contain vague or incomplete entries such as references to oceans, gulfs, or airborne locations, which are unlikely to return reliable coordinates. As such, part of the preparation process will involve removing overly ambiguous locations and standardizing the remaining entries where possible.

Code
aviation_locations <- aviation_clean %>%
  
  # Remove rows with missing locations
  filter(!is.na(Location)) %>%
  
  # Remove vague or unusable location entries
  filter(
    !str_detect(Location, regex(
      "Ocean|Sea|Gulf|River|Unknown|Near|Off",
      ignore_case = TRUE
    ))
  ) %>%
  
  # Standardize spacing
  mutate(
    Location = str_squish(Location)
  )

# Preview cleaned locations
aviation_locations %>%
  select(Location) %>%
  slice_head(n = 10)
                 Location
1       Hannover, Germany
2     Denpasar, Indonesia
3  El Segundo, California
4           Cluj, Romania
5         Berlin, Germany
6         Brauna, Germany
7  Rio de Janeiro, Brazil
8       Chicago, Illinois
9   Armstrong, ON, Canada
10       Atlanta, Georgia

Now, the dataset has been narrowed to records containing more usable geographic information. While some location inconsistencies may still remain, this cleaning step helps improve the likelihood of obtaining successful coordinate matches during the geocoding process.

Geocoding the Aviation Accident Data Locations

After cleaning the location field, the next step involves converting the remaining location entries into geographic coordinates. This process, known as geocoding, allows latitude and longitude values to be assigned to each accident record so that historical weather data can later be retrieved through the Open-Meteo API.

Code
aviation_geocoded <- aviation_locations %>%
  # Retrieve latitude and longitude coordinates
  geocode(
    address = Location,
    method = "osm",
    lat = latitude,
    long = longitude
  )

Since the geocoding process can take a significant amount of time to complete, the completed geocoded dataset was saved and uploaded to GitHub. The geocoding chunk above is therefore retained to show the method, but is not evaluated during rendering.

Importing the Saved Geocoded Dataset

The saved geocoded dataset is imported from GitHub so that the remaining analysis can be reproduced without repeatedly sending requests to the geocoding service.

Code
geocoded_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Final%20Project/aviation_geocoded.csv"

aviation_geocoded <- read.csv(geocoded_url)

aviation_geocoded %>%
  select(Location, latitude, longitude) %>%
  slice_head(n = 10)
                 Location   latitude   longitude
1       Hannover, Germany  52.374478    9.738553
2     Denpasar, Indonesia  -8.665335  115.217619
3  El Segundo, California  33.917028 -118.415634
4           Cluj, Romania  46.769379   23.589954
5         Berlin, Germany  52.517389   13.395131
6         Brauna, Germany  51.281811   14.037996
7  Rio de Janeiro, Brazil -22.911014  -43.209373
8       Chicago, Illinois  41.875562  -87.624421
9   Armstrong, ON, Canada  50.302131  -89.037370
10       Atlanta, Georgia  33.754466  -84.389815
Code
nrow(aviation_geocoded)
[1] 3128

Removing the Records Without Coordinates

Although most of the locations were successfully geocoded, some records did not return usable latitude or longitude values due to ambiguous or incomplete location descriptions. Since geographic coordinates are required for retrieving historical weather data, these records will be excluded from the weather integration stage.

Code
aviation_geocoded <- aviation_geocoded %>%
  filter(
    !is.na(latitude),
    !is.na(longitude)
  )

nrow(aviation_geocoded)  
[1] 2603

Feature Engineering

After narrowing the dataset to records with usable coordinates, additional variables will be created to support the analysis. These include fatality rate, survival count, decade, and a binary fatal accident indicator. These variables help translate the original crash records into measures that can be used for exploratory analysis and regression modeling.

Code
aviation_features <- aviation_geocoded %>%
  mutate(
    fatality_rate = Fatalities / Aboard,
    survival_count = Aboard - Fatalities,
    decade = floor(Year/10) * 10,
    fatal_accident = if_else(Fatalities > 0, 1, 0)
  ) %>%
  filter(
    !is.na(Aboard),
    !is.na(Fatalities),
    Aboard > 0,
    fatality_rate >= 0,
    fatality_rate <= 1
  )

glimpse(aviation_features)
Rows: 2,595
Columns: 20
$ Date           <chr> "1940-08-09", "1940-01-15", "1940-06-02", "1940-08-23",…
$ Time           <chr> "", "", "", "", "", "", "", "17:48", "2:00", "11:50", "…
$ Location       <chr> "Hannover, Germany", "Denpasar, Indonesia", "El Segundo…
$ Operator       <chr> "Deutsche Lufthansa", "KNILM", "Douglas Aircraft Compan…
$ Flight..       <chr> "", "", "", "", "", "", "", "21", "", "21", "", "", "",…
$ Route          <chr> "", "", "Test flight", "", "", "", "Rio de Janeiro - Sa…
$ Type           <chr> "Douglas DC-2-115H", "Lockheed 14 Super Electra", "Doug…
$ Registration   <chr> "D-AIAV", "PK-AFO", "", "YR-PAF", "D-AAIH", "D-AVMF", "…
$ cn.In          <chr> "1366", "1415", "", "1986", "1973", "10", "", "2175", "…
$ Aboard         <int> 13, 9, 5, 18, 15, 29, 18, 16, 12, 16, 10, 15, 22, 22, 1…
$ Fatalities     <int> 2, 8, 5, 14, 2, 29, 18, 10, 12, 9, 10, 15, 22, 22, 10, …
$ Ground         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Summary        <chr> "Pilot error.", "", "Crashed and burned during a govern…
$ Year           <int> 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1941, 1…
$ latitude       <dbl> 52.374478, -8.665335, 33.917028, 46.769379, 52.517389, …
$ longitude      <dbl> 9.738553, 115.217619, -118.415634, 23.589954, 13.395131…
$ fatality_rate  <dbl> 0.1538462, 0.8888889, 1.0000000, 0.7777778, 0.1333333, …
$ survival_count <int> 11, 1, 0, 4, 13, 0, 0, 6, 0, 7, 0, 0, 0, 0, 0, 13, 0, 0…
$ decade         <dbl> 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1…
$ fatal_accident <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Following the feature engineering process, the dataset contains several derived variables that will be used throughout the remainder of the analysis. The fatality rate is particularly important because it allows accident severity to be examined relative to the number of persons aboard rather than only through raw fatality counts.

Checking the Engineered Variables

Before moving into the exploratory analysis, the newly created variables can be summarized to ensure that they were calculated as expected.

Code
aviation_features %>%
  select(Aboard, Fatalities, fatality_rate, survival_count, decade, fatal_accident) %>%
  summary()
     Aboard         Fatalities     fatality_rate    survival_count  
 Min.   :  1.00   Min.   :  0.00   Min.   :0.0000   Min.   :  0.00  
 1st Qu.:  4.00   1st Qu.:  3.00   1st Qu.:0.6667   1st Qu.:  0.00  
 Median : 11.00   Median :  6.00   Median :1.0000   Median :  0.00  
 Mean   : 26.68   Mean   : 16.41   Mean   :0.7926   Mean   : 10.27  
 3rd Qu.: 29.00   3rd Qu.: 19.00   3rd Qu.:1.0000   3rd Qu.:  4.00  
 Max.   :644.00   Max.   :583.00   Max.   :1.0000   Max.   :516.00  
     decade     fatal_accident 
 Min.   :1940   Min.   :0.000  
 1st Qu.:1960   1st Qu.:1.000  
 Median :1970   Median :1.000  
 Mean   :1972   Mean   :0.985  
 3rd Qu.:1990   3rd Qu.:1.000  
 Max.   :2000   Max.   :1.000  

The engineered variables appear to have been generated successfully. The summary statistics indicate that many accidents within the dataset resulted in high fatality rates, with the median fatality rate equaling 1.0, suggesting that at least half of the retained accidents resulted in no survivors. In addition, the fatal_accident indicator itself shows that the overwhelming majority of retained records involved at least one fatality. These results are not entirely unexpected, since more severe accidents are generally more likely to be documented historically than minor incidents.

Weather API Integration

The next stage of the workflow involves retrieving historical weather information for each aviation accident. Since the Open-Meteo Historical Weather API accepts latitude, longitude, and date values, the geocoded accident records can now be used to request weather information for the corresponding accident date and location. The selected weather variables include mean temperature, precipitation sum, and maximum wind speed.

Code
get_weather_data <- function(latitude, longitude, date) {
  
  request_url <- "https://archive-api.open-meteo.com/v1/archive"
  
  tryCatch({
    
    response <- request(request_url) %>%
      req_url_query(
        latitude = latitude,
        longitude = longitude,
        start_date = as.character(date),
        end_date = as.character(date),
        daily = "temperature_2m_mean,precipitation_sum,wind_speed_10m_max",
        timezone = "auto"
      ) %>%
      req_perform()
    
    weather_json <- response %>%
      resp_body_json()
    
    tibble(
      weather_date = weather_json$daily$time[[1]],
      temperature_mean = weather_json$daily$temperature_2m_mean[[1]],
      precipitation_sum = weather_json$daily$precipitation_sum[[1]],
      wind_speed_max = weather_json$daily$wind_speed_10m_max[[1]]
    )
    
  }, error = function(e) {
    
    tibble(
      weather_date = as.character(date),
      temperature_mean = NA_real_,
      precipitation_sum = NA_real_,
      wind_speed_max = NA_real_
    )
  })
}

Before applying the function to all records, it will first be tested on one row to ensure that the API returns the expected structure.

Code
test_weather <- get_weather_data(
  latitude = aviation_features$latitude[1],
  longitude = aviation_features$longitude[1],
  date = aviation_features$Date[1]
)

test_weather
# A tibble: 1 × 4
  weather_date temperature_mean precipitation_sum wind_speed_max
  <chr>                   <dbl>             <dbl>          <dbl>
1 1940-08-09               16.8                 0           23.9

The test API call returned a valid weather record for the selected accident date and location. This confirms that the function can retrieve the daily weather variables needed for the analysis.

Retrieving Weather Data for All Records

The function is then applied to the full geocoded dataset. Since this requires many API calls, this chunk should be treated as a one-time execution step. Once completed, the weather-enriched dataset will be saved and uploaded to GitHub so that the full Quarto document can later be rendered without repeating the API requests.

Code
aviation_weather <- aviation_features %>%
  mutate(row_id = row_number()) %>%
  mutate(
    weather_data = pmap(
      list(latitude, longitude, Date),
      get_weather_data
    )
  ) %>%
  unnest(weather_data)

Importing the Weather-Enriched Data

Since retrieving the weather data for all records requires repeated API calls, the completed weather-enriched dataset is imported directly from GitHub for reproducibility purposes.

Code
weather_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Final%20Project/aviation_weather_enriched.csv"

aviation_weather <- read.csv(weather_url)
Code
glimpse(aviation_weather)
Rows: 2,595
Columns: 25
$ Date              <chr> "1940-08-09", "1940-01-15", "1940-06-02", "1940-08-2…
$ Time              <chr> "", "", "", "", "", "", "", "17:48", "2:00", "11:50"…
$ Location          <chr> "Hannover, Germany", "Denpasar, Indonesia", "El Segu…
$ Operator          <chr> "Deutsche Lufthansa", "KNILM", "Douglas Aircraft Com…
$ Flight..          <chr> "", "", "", "", "", "", "", "21", "", "21", "", "", …
$ Route             <chr> "", "", "Test flight", "", "", "", "Rio de Janeiro -…
$ Type              <chr> "Douglas DC-2-115H", "Lockheed 14 Super Electra", "D…
$ Registration      <chr> "D-AIAV", "PK-AFO", "", "YR-PAF", "D-AAIH", "D-AVMF"…
$ cn.In             <chr> "1366", "1415", "", "1986", "1973", "10", "", "2175"…
$ Aboard            <int> 13, 9, 5, 18, 15, 29, 18, 16, 12, 16, 10, 15, 22, 22…
$ Fatalities        <int> 2, 8, 5, 14, 2, 29, 18, 10, 12, 9, 10, 15, 22, 22, 1…
$ Ground            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Summary           <chr> "Pilot error.", "", "Crashed and burned during a gov…
$ Year              <int> 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1941…
$ latitude          <dbl> 52.374478, -8.665335, 33.917028, 46.769379, 52.51738…
$ longitude         <dbl> 9.738553, 115.217619, -118.415634, 23.589954, 13.395…
$ fatality_rate     <dbl> 0.1538462, 0.8888889, 1.0000000, 0.7777778, 0.133333…
$ survival_count    <int> 11, 1, 0, 4, 13, 0, 0, 6, 0, 7, 0, 0, 0, 0, 0, 13, 0…
$ decade            <int> 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940…
$ fatal_accident    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ row_id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ weather_date      <chr> "1940-08-09", "1940-01-15", "1940-06-02", "1940-08-2…
$ temperature_mean  <dbl> 16.8, 25.3, 19.4, 15.8, -0.8, 3.9, 21.8, -3.0, -11.9…
$ precipitation_sum <dbl> 0.0, 29.3, 0.0, 0.3, 0.0, 4.6, 0.4, 0.7, 0.5, 4.9, 0…
$ wind_speed_max    <dbl> 23.9, 16.7, 17.7, 16.0, 19.1, 22.9, 9.5, 26.4, 26.5,…
Code
head(aviation_weather, 5)
        Date Time               Location                 Operator Flight..
1 1940-08-09           Hannover, Germany       Deutsche Lufthansa         
2 1940-01-15         Denpasar, Indonesia                    KNILM         
3 1940-06-02      El Segundo, California Douglas Aircraft Company         
4 1940-08-23               Cluj, Romania                    LARES         
5 1940-10-29             Berlin, Germany       Deutsche Lufthansa         
        Route                      Type Registration cn.In Aboard Fatalities
1                     Douglas DC-2-115H       D-AIAV  1366     13          2
2             Lockheed 14 Super Electra       PK-AFO  1415      9          8
3 Test flight              Douglas DC-3                         5          5
4                          Douglas DC-3       YR-PAF  1986     18         14
5                          Douglas DC-3       D-AAIH  1973     15          2
  Ground                                              Summary Year  latitude
1      0                                         Pilot error. 1940 52.374478
2      0                                                      1940 -8.665335
3      0   Crashed and burned during a government test flight 1940 33.917028
4      0 Crashed into a mountainous area during a hail storm. 1940 46.769379
5      0                                     Weather related. 1940 52.517389
    longitude fatality_rate survival_count decade fatal_accident row_id
1    9.738553     0.1538462             11   1940              1      1
2  115.217619     0.8888889              1   1940              1      2
3 -118.415634     1.0000000              0   1940              1      3
4   23.589954     0.7777778              4   1940              1      4
5   13.395131     0.1333333             13   1940              1      5
  weather_date temperature_mean precipitation_sum wind_speed_max
1   1940-08-09             16.8               0.0           23.9
2   1940-01-15             25.3              29.3           16.7
3   1940-06-02             19.4               0.0           17.7
4   1940-08-23             15.8               0.3           16.0
5   1940-10-29             -0.8               0.0           19.1

The API retrieval successfully added historical weather variables to the aviation accident records, including mean temperature, precipitation sum, and maximum wind speed. These variables will now be used alongside the engineered aviation severity measures for exploratory analysis and regression modelling.

Exploratory Analysis

Before constructing a regression model, exploratory analysis will first be used to examine general patterns within the aviation accident data. Particular attention will be given to accident severity over time, as well as the potential relationship between weather conditions and fatality outcomes.

Fatality Rate Over Time

The following visualization examines how fatality rates have varied across the years represented within the dataset.

Code
ggplot(data = aviation_weather, aes(x = Year, y = fatality_rate)) +
 geom_jitter(alpha = 0.15, width = 0.5, height = 0.02) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Fatality Rate across Aviation Incidents Over Time",
    x = "Year",
    y = "Fatality Rate"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

The visualization indicates that many aviation accidents within the dataset resulted in very high fatality rates, with a clear concentration of observations near a fatality rate of 1.0. This suggests that the retained dataset is heavily represented by severe accidents. At the same time, the fitted trend line shows a slight downward pattern across time, which may point to gradual improvements in aviation safety, aircraft technology, and operational practices. However, the wide spread of observations also shows that accident severity continues to vary substantially across individual cases.

Wind Speed and Fatality Rate

The following visualization examines the relationship between maximum wind speed and accident fatality rate.

Code
ggplot(data = aviation_weather, aes(x = wind_speed_max, y = fatality_rate)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Wind Speed and Aviation Accident Fatality Rate",
    x = "Maximum Wind Speed",
    y = "Fatality Rate"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

The visualization shows substantial variability between wind speed and aviation accident fatality rate, indicating that wind conditions alone do not fully explain accident severity. However, the fitted trend line suggests a slight positive relationship, wherein higher wind speeds may be associated with somewhat more severe crash outcomes on average. Despite this, the broad spread of observations demonstrates that aviation accident severity is likely associated with several factors rather than wind conditions alone.

Regression Analysis

After completing the exploratory analysis, a linear regression model will be constructed in order to examine whether the selected aviation and weather-related variables appear associated with aviation accident fatality rate. The response variable for the model is fatality_rate, while the explanatory variables include persons aboard, year, mean temperature, precipitation, and maximum wind speed.

Code
fatality_model <- lm(fatality_rate ~ Aboard + Year + temperature_mean + precipitation_sum + wind_speed_max, data = aviation_weather)

summary(fatality_model)

Call:
lm(formula = fatality_rate ~ Aboard + Year + temperature_mean + 
    precipitation_sum + wind_speed_max, data = aviation_weather)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8569 -0.1309  0.1477  0.1780  1.6100 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.3169853  0.6581659   2.001   0.0455 *  
Aboard            -0.0024308  0.0001411 -17.226   <2e-16 ***
Year              -0.0002299  0.0003327  -0.691   0.4896    
temperature_mean  -0.0013051  0.0005492  -2.376   0.0176 *  
precipitation_sum  0.0002704  0.0006015   0.450   0.6531    
wind_speed_max     0.0007219  0.0007407   0.975   0.3298    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3126 on 2589 degrees of freedom
Multiple R-squared:  0.1095,    Adjusted R-squared:  0.1078 
F-statistic: 63.65 on 5 and 2589 DF,  p-value: < 2.2e-16

Interpreting the Regression Results

The regression model produced a statistically significant overall result (F-statistic p-value < 2.2e-16), suggesting that the selected explanatory variables collectively exhibit some relationship with aviation accident fatality rate. However, the adjusted R-squared value of approximately 0.108 indicates that the model explains only a modest portion of the variability in fatality outcomes. This suggests that aviation accident severity is likely associated with additional operational, mechanical, environmental, and human-related factors not captured within the present analysis.

Among the explanatory variables, the number of persons aboard displayed a statistically significant negative relationship with fatality rate (p < 2e-16). However, this relationship should be interpreted cautiously, since the fatality rate variable was itself partially constructed using the number of persons aboard. As such, part of the observed statistical relationship may reflect the mathematical structure of the response variable rather than a purely independent operational relationship. Nevertheless, the result may still suggest that accidents involving larger commercial aircraft can exhibit somewhat lower proportional fatality outcomes on average.

Mean temperature also exhibited a statistically significant negative relationship with fatality rate (p = 0.0176), although the magnitude of the relationship remained relatively small. This may suggest that colder environmental conditions are modestly associated with more severe crash outcomes, though the relationship should still be interpreted cautiously.

In contrast, year (p = 0.4896), precipitation (p = 0.6531), and maximum wind speed (p = 0.3298) did not appear statistically significant within the fitted model. While the exploratory visualizations suggested that stronger wind conditions may exhibit a slight positive relationship with accident severity, the regression analysis indicates that this relationship weakens once multiple variables are considered simultaneously. This highlights the complexity of aviation accidents, where severity is likely associated with numerous interacting factors rather than any single variable alone.

Model Diagnostics

After fitting the regression model, diagnostic plots were examined in order to assess whether the model residuals exhibited any major violations of linear regression assumptions. Particular attention was given to the distribution of residuals and the relationship between fitted values and residual spread.

Code
par(mfrow = c(1, 2))

plot(
  fatality_model,
  which = 1
)

plot(
  fatality_model,
  which = 2
)

Code
par(mfrow = c(1, 1))

The diagnostic plots suggest that the regression model captures some structure in the data, but several limitations remain. The residuals versus fitted plot shows clustering and uneven spread, while the Q-Q plot shows departures from normality, especially in the tails. This suggests that the relationship between the selected variables and fatality rate is not perfectly linear, which is expected given the complexity of aviation accident severity.

Conclusion

This project examined historical aviation accident data in order to explore whether variables such as persons aboard, year, and weather conditions appeared associated with accident fatality severity. Through the integration of a structured aviation crash dataset, geocoding techniques, and API-based historical weather information, the analysis demonstrated how multiple data sources can be combined within a reproducible workflow.

The exploratory analysis and regression modelling suggested that aviation accident severity is associated with substantial variability and complexity. Only some of the selected variables displayed statistically significant relationships, with the persons aboard result requiring additional caution because of how fatality rate was constructed. As such, the findings reinforce the idea that aviation accident severity is associated with numerous interacting operational, environmental, mechanical, and human-related factors rather than any single contributing condition.

References

Grandi, S. (n.d.). Airplane crashes since 1908 [Data set]. Kaggle. https://www.kaggle.com/datasets/saurograndi/airplane-crashes-since-1908

Open-Meteo. (n.d.). Historical weather API. https://open-meteo.com/en/docs/historical-weather-api

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz/

LLM Used