Code
library(tidyverse)
library(lubridate)
library(httr2)
library(jsonlite)
library(tidygeocoder)
library(broom)
library(scales)This project examines historical aviation accident data to explore factors associated with the severity of crash outcomes. The analysis combines a structured Kaggle aviation accident dataset with historical weather data retrieved through the Open-Meteo Historical Weather API. Since the accident dataset stores locations as free-text rather than geographic coordinates, the workflow involved narrowing the dataset to clearer location records, applying geocoding to obtain latitude and longitude values, and using those coordinates to retrieve weather variables such as mean temperature, precipitation, and maximum wind speed.
The analysis then used feature engineering, exploratory visualization, and linear regression modeling to examine relationships between fatality rate and selected variables, including persons aboard, year, and weather conditions. The results showed that fatality rates were generally high within the retained dataset, although a slight downward trend appeared over time. The regression model was statistically significant, but its adjusted R-squared value suggested that it explained only a modest portion of variation in fatality rate. Persons aboard and mean temperature were statistically significant within the model, though the persons aboard result should be interpreted cautiously because fatality rate was calculated using the persons aboard variable.
As such, the findings suggest that aviation accident severity is associated with multiple interacting factors beyond the limited variables included in this analysis. The project also demonstrates a reproducible workflow through Quarto and RStudio, including data cleaning, API integration, regression analysis, and presentation generation.
This project examines historical aviation accident data to explore factors that may be associated with the severity of crash outcomes. The topic was selected due to the increased public attention surrounding aviation crashes, near misses, and other safety-related incidents. While this attention may reflect increased reporting rather than an actual rise in incidents, it still provides a useful motivation for examining historical crash data more closely.
The analysis uses two different data source types: a structured Kaggle CSV containing aviation accident records and API-based historical weather data from Open-Meteo. Since the Kaggle dataset stores locations as free-text entries rather than coordinates, the workflow will involve narrowing the dataset to usable records, applying geocoding, retrieving weather variables, and then using exploratory analysis and regression modeling to examine patterns in accident severity.
The first step involves loading the packages needed for data cleaning, geocoding, API access, visualization, and regression analysis.
library(tidyverse)
library(lubridate)
library(httr2)
library(jsonlite)
library(tidygeocoder)
library(broom)
library(scales)The aviation accident dataset was first downloaded from Kaggle and then uploaded to a personal GitHub repository. This allows the dataset to be imported directly through the raw GitHub URL, making the workflow more reproducible and avoiding references to any local file paths.
aviation_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Final%20Project/airplane_crashes_and_fatalities_since_1908.csv"
aviation_raw <- read.csv(aviation_url)
glimpse(aviation_raw)Rows: 5,268
Columns: 13
$ Date <chr> "09/17/1908", "07/12/1912", "08/06/1913", "09/09/1913", "…
$ Time <chr> "17:18", "6:30", "", "18:30", "10:30", "1:00", "15:20", "…
$ Location <chr> "Fort Myer, Virginia", "AtlantiCity, New Jersey", "Victor…
$ Operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Private"…
$ Flight.. <chr> "", "", "-", "", "", "", "", "", "", "", "", "", "", "", …
$ Route <chr> "Demonstration", "Test flight", "", "", "", "", "", "", "…
$ Type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane", "Zep…
$ Registration <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "…
$ cn.In <chr> "1", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ Aboard <int> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23, 21, …
$ Fatalities <int> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23, 21, …
$ Ground <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Summary <chr> "During a demonstration flight, a U.S. Army flyer flown b…
At this stage, the dataset remains in its original imported form. The next step will involve inspecting the variable names and cleaning the date, location, and numeric fields for the analysis.
Before our analysis can be executed, the structure of the dataset must first be inspected in order to identify the variables required for the project’s scope. Particular attention will be given to the date, location, persons aboard, and fatality-related fields, since these variables will later be used for feature engineering, geocoding, weather integration, and regression analysis.
names(aviation_raw) [1] "Date" "Time" "Location" "Operator" "Flight.."
[6] "Route" "Type" "Registration" "cn.In" "Aboard"
[11] "Fatalities" "Ground" "Summary"
summary(aviation_raw) Date Time Location Operator
Length:5268 Length:5268 Length:5268 Length:5268
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Flight.. Route Type Registration
Length:5268 Length:5268 Length:5268 Length:5268
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
cn.In Aboard Fatalities Ground
Length:5268 Min. : 0.00 Min. : 0.00 Min. : 0.000
Class :character 1st Qu.: 5.00 1st Qu.: 3.00 1st Qu.: 0.000
Mode :character Median : 13.00 Median : 9.00 Median : 0.000
Mean : 27.55 Mean : 20.07 Mean : 1.609
3rd Qu.: 30.00 3rd Qu.: 23.00 3rd Qu.: 0.000
Max. :644.00 Max. :583.00 Max. :2750.000
NA's :22 NA's :12 NA's :22
Summary
Length:5268
Class :character
Mode :character
Before beginning the exploratory analysis, several variables must first be cleaned and transformed into formats more suitable for analysis. In particular, the date field must be converted into a usable date format, while additional variables such as year and fatality rate will later be derived through feature engineering.
aviation_clean <- aviation_raw %>%
# Convert date column into date format
mutate(
Date = mdy(Date),
#Create Year variable
Year = year(Date)
) %>%
# Restrict data to years with more reliable weather coverage
filter(Year >= 1940)
glimpse(aviation_clean)Rows: 4,739
Columns: 14
$ Date <date> 1940-08-09, 1941-06-03, 1940-01-15, 1940-03-01, 1940-04-…
$ Time <chr> "", "17:00", "", "", "", "", "14:00", "", "", "10:15", ""…
$ Location <chr> "Hannover, Germany", "AtlantiOcean", "Denpasar, Indonesia…
$ Operator <chr> "Deutsche Lufthansa", "Great Western and Southern Air Lin…
$ Flight.. <chr> "", "", "", "", "", "", "", "", "", "", "", "19", "", "",…
$ Route <chr> "", "", "", "Jask to Sharjah", "Perth, Scotland - London,…
$ Type <chr> "Douglas DC-2-115H", "de Havilland DH-84 Dragon", "Lockhe…
$ Registration <chr> "D-AIAV", "G-ACPY", "PK-AFO", "G-AAGX", "G-AFKD", "", "OH…
$ cn.In <chr> "1366", "6076", "1415", "HP42/1", "1484", "", "5494", "22…
$ Aboard <int> 13, 6, 9, 8, 3, 5, 9, 1, NA, 10, 18, 25, 14, 15, 10, 29, …
$ Fatalities <int> 2, 6, 8, 8, 3, 5, 9, 1, NA, 10, 14, 25, 9, 2, 10, 29, 18,…
$ Ground <int> 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Summary <chr> "Pilot error.", "Shot down by a He-111 German military ai…
$ Year <dbl> 1940, 1941, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 194…
Consequently, the dataset has been narrowed to records occurring from 1940 onward, since the historical weather API used downstream in the analysis provides more reliable coverage for this period. In addition, a separate year variable has been extracted from the original date field to support later trend analysis and regression modeling.
Since the weather API requires latitude and longitude coordinates, the location field must first be cleaned before geocoding can be performed. Several records contain vague or incomplete entries such as references to oceans, gulfs, or airborne locations, which are unlikely to return reliable coordinates. As such, part of the preparation process will involve removing overly ambiguous locations and standardizing the remaining entries where possible.
aviation_locations <- aviation_clean %>%
# Remove rows with missing locations
filter(!is.na(Location)) %>%
# Remove vague or unusable location entries
filter(
!str_detect(Location, regex(
"Ocean|Sea|Gulf|River|Unknown|Near|Off",
ignore_case = TRUE
))
) %>%
# Standardize spacing
mutate(
Location = str_squish(Location)
)
# Preview cleaned locations
aviation_locations %>%
select(Location) %>%
slice_head(n = 10) Location
1 Hannover, Germany
2 Denpasar, Indonesia
3 El Segundo, California
4 Cluj, Romania
5 Berlin, Germany
6 Brauna, Germany
7 Rio de Janeiro, Brazil
8 Chicago, Illinois
9 Armstrong, ON, Canada
10 Atlanta, Georgia
Now, the dataset has been narrowed to records containing more usable geographic information. While some location inconsistencies may still remain, this cleaning step helps improve the likelihood of obtaining successful coordinate matches during the geocoding process.
After cleaning the location field, the next step involves converting the remaining location entries into geographic coordinates. This process, known as geocoding, allows latitude and longitude values to be assigned to each accident record so that historical weather data can later be retrieved through the Open-Meteo API.
aviation_geocoded <- aviation_locations %>%
# Retrieve latitude and longitude coordinates
geocode(
address = Location,
method = "osm",
lat = latitude,
long = longitude
)Since the geocoding process can take a significant amount of time to complete, the completed geocoded dataset was saved and uploaded to GitHub. The geocoding chunk above is therefore retained to show the method, but is not evaluated during rendering.
The saved geocoded dataset is imported from GitHub so that the remaining analysis can be reproduced without repeatedly sending requests to the geocoding service.
geocoded_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Final%20Project/aviation_geocoded.csv"
aviation_geocoded <- read.csv(geocoded_url)
aviation_geocoded %>%
select(Location, latitude, longitude) %>%
slice_head(n = 10) Location latitude longitude
1 Hannover, Germany 52.374478 9.738553
2 Denpasar, Indonesia -8.665335 115.217619
3 El Segundo, California 33.917028 -118.415634
4 Cluj, Romania 46.769379 23.589954
5 Berlin, Germany 52.517389 13.395131
6 Brauna, Germany 51.281811 14.037996
7 Rio de Janeiro, Brazil -22.911014 -43.209373
8 Chicago, Illinois 41.875562 -87.624421
9 Armstrong, ON, Canada 50.302131 -89.037370
10 Atlanta, Georgia 33.754466 -84.389815
nrow(aviation_geocoded)[1] 3128
Although most of the locations were successfully geocoded, some records did not return usable latitude or longitude values due to ambiguous or incomplete location descriptions. Since geographic coordinates are required for retrieving historical weather data, these records will be excluded from the weather integration stage.
aviation_geocoded <- aviation_geocoded %>%
filter(
!is.na(latitude),
!is.na(longitude)
)
nrow(aviation_geocoded) [1] 2603
After narrowing the dataset to records with usable coordinates, additional variables will be created to support the analysis. These include fatality rate, survival count, decade, and a binary fatal accident indicator. These variables help translate the original crash records into measures that can be used for exploratory analysis and regression modeling.
aviation_features <- aviation_geocoded %>%
mutate(
fatality_rate = Fatalities / Aboard,
survival_count = Aboard - Fatalities,
decade = floor(Year/10) * 10,
fatal_accident = if_else(Fatalities > 0, 1, 0)
) %>%
filter(
!is.na(Aboard),
!is.na(Fatalities),
Aboard > 0,
fatality_rate >= 0,
fatality_rate <= 1
)
glimpse(aviation_features)Rows: 2,595
Columns: 20
$ Date <chr> "1940-08-09", "1940-01-15", "1940-06-02", "1940-08-23",…
$ Time <chr> "", "", "", "", "", "", "", "17:48", "2:00", "11:50", "…
$ Location <chr> "Hannover, Germany", "Denpasar, Indonesia", "El Segundo…
$ Operator <chr> "Deutsche Lufthansa", "KNILM", "Douglas Aircraft Compan…
$ Flight.. <chr> "", "", "", "", "", "", "", "21", "", "21", "", "", "",…
$ Route <chr> "", "", "Test flight", "", "", "", "Rio de Janeiro - Sa…
$ Type <chr> "Douglas DC-2-115H", "Lockheed 14 Super Electra", "Doug…
$ Registration <chr> "D-AIAV", "PK-AFO", "", "YR-PAF", "D-AAIH", "D-AVMF", "…
$ cn.In <chr> "1366", "1415", "", "1986", "1973", "10", "", "2175", "…
$ Aboard <int> 13, 9, 5, 18, 15, 29, 18, 16, 12, 16, 10, 15, 22, 22, 1…
$ Fatalities <int> 2, 8, 5, 14, 2, 29, 18, 10, 12, 9, 10, 15, 22, 22, 10, …
$ Ground <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Summary <chr> "Pilot error.", "", "Crashed and burned during a govern…
$ Year <int> 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1941, 1…
$ latitude <dbl> 52.374478, -8.665335, 33.917028, 46.769379, 52.517389, …
$ longitude <dbl> 9.738553, 115.217619, -118.415634, 23.589954, 13.395131…
$ fatality_rate <dbl> 0.1538462, 0.8888889, 1.0000000, 0.7777778, 0.1333333, …
$ survival_count <int> 11, 1, 0, 4, 13, 0, 0, 6, 0, 7, 0, 0, 0, 0, 0, 13, 0, 0…
$ decade <dbl> 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1…
$ fatal_accident <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Following the feature engineering process, the dataset contains several derived variables that will be used throughout the remainder of the analysis. The fatality rate is particularly important because it allows accident severity to be examined relative to the number of persons aboard rather than only through raw fatality counts.
Before moving into the exploratory analysis, the newly created variables can be summarized to ensure that they were calculated as expected.
aviation_features %>%
select(Aboard, Fatalities, fatality_rate, survival_count, decade, fatal_accident) %>%
summary() Aboard Fatalities fatality_rate survival_count
Min. : 1.00 Min. : 0.00 Min. :0.0000 Min. : 0.00
1st Qu.: 4.00 1st Qu.: 3.00 1st Qu.:0.6667 1st Qu.: 0.00
Median : 11.00 Median : 6.00 Median :1.0000 Median : 0.00
Mean : 26.68 Mean : 16.41 Mean :0.7926 Mean : 10.27
3rd Qu.: 29.00 3rd Qu.: 19.00 3rd Qu.:1.0000 3rd Qu.: 4.00
Max. :644.00 Max. :583.00 Max. :1.0000 Max. :516.00
decade fatal_accident
Min. :1940 Min. :0.000
1st Qu.:1960 1st Qu.:1.000
Median :1970 Median :1.000
Mean :1972 Mean :0.985
3rd Qu.:1990 3rd Qu.:1.000
Max. :2000 Max. :1.000
The engineered variables appear to have been generated successfully. The summary statistics indicate that many accidents within the dataset resulted in high fatality rates, with the median fatality rate equaling 1.0, suggesting that at least half of the retained accidents resulted in no survivors. In addition, the fatal_accident indicator itself shows that the overwhelming majority of retained records involved at least one fatality. These results are not entirely unexpected, since more severe accidents are generally more likely to be documented historically than minor incidents.
The next stage of the workflow involves retrieving historical weather information for each aviation accident. Since the Open-Meteo Historical Weather API accepts latitude, longitude, and date values, the geocoded accident records can now be used to request weather information for the corresponding accident date and location. The selected weather variables include mean temperature, precipitation sum, and maximum wind speed.
get_weather_data <- function(latitude, longitude, date) {
request_url <- "https://archive-api.open-meteo.com/v1/archive"
tryCatch({
response <- request(request_url) %>%
req_url_query(
latitude = latitude,
longitude = longitude,
start_date = as.character(date),
end_date = as.character(date),
daily = "temperature_2m_mean,precipitation_sum,wind_speed_10m_max",
timezone = "auto"
) %>%
req_perform()
weather_json <- response %>%
resp_body_json()
tibble(
weather_date = weather_json$daily$time[[1]],
temperature_mean = weather_json$daily$temperature_2m_mean[[1]],
precipitation_sum = weather_json$daily$precipitation_sum[[1]],
wind_speed_max = weather_json$daily$wind_speed_10m_max[[1]]
)
}, error = function(e) {
tibble(
weather_date = as.character(date),
temperature_mean = NA_real_,
precipitation_sum = NA_real_,
wind_speed_max = NA_real_
)
})
}Before applying the function to all records, it will first be tested on one row to ensure that the API returns the expected structure.
test_weather <- get_weather_data(
latitude = aviation_features$latitude[1],
longitude = aviation_features$longitude[1],
date = aviation_features$Date[1]
)
test_weather# A tibble: 1 × 4
weather_date temperature_mean precipitation_sum wind_speed_max
<chr> <dbl> <dbl> <dbl>
1 1940-08-09 16.8 0 23.9
The test API call returned a valid weather record for the selected accident date and location. This confirms that the function can retrieve the daily weather variables needed for the analysis.
The function is then applied to the full geocoded dataset. Since this requires many API calls, this chunk should be treated as a one-time execution step. Once completed, the weather-enriched dataset will be saved and uploaded to GitHub so that the full Quarto document can later be rendered without repeating the API requests.
aviation_weather <- aviation_features %>%
mutate(row_id = row_number()) %>%
mutate(
weather_data = pmap(
list(latitude, longitude, Date),
get_weather_data
)
) %>%
unnest(weather_data)Since retrieving the weather data for all records requires repeated API calls, the completed weather-enriched dataset is imported directly from GitHub for reproducibility purposes.
weather_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Final%20Project/aviation_weather_enriched.csv"
aviation_weather <- read.csv(weather_url)glimpse(aviation_weather)Rows: 2,595
Columns: 25
$ Date <chr> "1940-08-09", "1940-01-15", "1940-06-02", "1940-08-2…
$ Time <chr> "", "", "", "", "", "", "", "17:48", "2:00", "11:50"…
$ Location <chr> "Hannover, Germany", "Denpasar, Indonesia", "El Segu…
$ Operator <chr> "Deutsche Lufthansa", "KNILM", "Douglas Aircraft Com…
$ Flight.. <chr> "", "", "", "", "", "", "", "21", "", "21", "", "", …
$ Route <chr> "", "", "Test flight", "", "", "", "Rio de Janeiro -…
$ Type <chr> "Douglas DC-2-115H", "Lockheed 14 Super Electra", "D…
$ Registration <chr> "D-AIAV", "PK-AFO", "", "YR-PAF", "D-AAIH", "D-AVMF"…
$ cn.In <chr> "1366", "1415", "", "1986", "1973", "10", "", "2175"…
$ Aboard <int> 13, 9, 5, 18, 15, 29, 18, 16, 12, 16, 10, 15, 22, 22…
$ Fatalities <int> 2, 8, 5, 14, 2, 29, 18, 10, 12, 9, 10, 15, 22, 22, 1…
$ Ground <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Summary <chr> "Pilot error.", "", "Crashed and burned during a gov…
$ Year <int> 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1941…
$ latitude <dbl> 52.374478, -8.665335, 33.917028, 46.769379, 52.51738…
$ longitude <dbl> 9.738553, 115.217619, -118.415634, 23.589954, 13.395…
$ fatality_rate <dbl> 0.1538462, 0.8888889, 1.0000000, 0.7777778, 0.133333…
$ survival_count <int> 11, 1, 0, 4, 13, 0, 0, 6, 0, 7, 0, 0, 0, 0, 0, 13, 0…
$ decade <int> 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940…
$ fatal_accident <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ row_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ weather_date <chr> "1940-08-09", "1940-01-15", "1940-06-02", "1940-08-2…
$ temperature_mean <dbl> 16.8, 25.3, 19.4, 15.8, -0.8, 3.9, 21.8, -3.0, -11.9…
$ precipitation_sum <dbl> 0.0, 29.3, 0.0, 0.3, 0.0, 4.6, 0.4, 0.7, 0.5, 4.9, 0…
$ wind_speed_max <dbl> 23.9, 16.7, 17.7, 16.0, 19.1, 22.9, 9.5, 26.4, 26.5,…
head(aviation_weather, 5) Date Time Location Operator Flight..
1 1940-08-09 Hannover, Germany Deutsche Lufthansa
2 1940-01-15 Denpasar, Indonesia KNILM
3 1940-06-02 El Segundo, California Douglas Aircraft Company
4 1940-08-23 Cluj, Romania LARES
5 1940-10-29 Berlin, Germany Deutsche Lufthansa
Route Type Registration cn.In Aboard Fatalities
1 Douglas DC-2-115H D-AIAV 1366 13 2
2 Lockheed 14 Super Electra PK-AFO 1415 9 8
3 Test flight Douglas DC-3 5 5
4 Douglas DC-3 YR-PAF 1986 18 14
5 Douglas DC-3 D-AAIH 1973 15 2
Ground Summary Year latitude
1 0 Pilot error. 1940 52.374478
2 0 1940 -8.665335
3 0 Crashed and burned during a government test flight 1940 33.917028
4 0 Crashed into a mountainous area during a hail storm. 1940 46.769379
5 0 Weather related. 1940 52.517389
longitude fatality_rate survival_count decade fatal_accident row_id
1 9.738553 0.1538462 11 1940 1 1
2 115.217619 0.8888889 1 1940 1 2
3 -118.415634 1.0000000 0 1940 1 3
4 23.589954 0.7777778 4 1940 1 4
5 13.395131 0.1333333 13 1940 1 5
weather_date temperature_mean precipitation_sum wind_speed_max
1 1940-08-09 16.8 0.0 23.9
2 1940-01-15 25.3 29.3 16.7
3 1940-06-02 19.4 0.0 17.7
4 1940-08-23 15.8 0.3 16.0
5 1940-10-29 -0.8 0.0 19.1
The API retrieval successfully added historical weather variables to the aviation accident records, including mean temperature, precipitation sum, and maximum wind speed. These variables will now be used alongside the engineered aviation severity measures for exploratory analysis and regression modelling.
Before constructing a regression model, exploratory analysis will first be used to examine general patterns within the aviation accident data. Particular attention will be given to accident severity over time, as well as the potential relationship between weather conditions and fatality outcomes.
The following visualization examines how fatality rates have varied across the years represented within the dataset.
ggplot(data = aviation_weather, aes(x = Year, y = fatality_rate)) +
geom_jitter(alpha = 0.15, width = 0.5, height = 0.02) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Fatality Rate across Aviation Incidents Over Time",
x = "Year",
y = "Fatality Rate"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))The visualization indicates that many aviation accidents within the dataset resulted in very high fatality rates, with a clear concentration of observations near a fatality rate of 1.0. This suggests that the retained dataset is heavily represented by severe accidents. At the same time, the fitted trend line shows a slight downward pattern across time, which may point to gradual improvements in aviation safety, aircraft technology, and operational practices. However, the wide spread of observations also shows that accident severity continues to vary substantially across individual cases.
The following visualization examines the relationship between maximum wind speed and accident fatality rate.
ggplot(data = aviation_weather, aes(x = wind_speed_max, y = fatality_rate)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Wind Speed and Aviation Accident Fatality Rate",
x = "Maximum Wind Speed",
y = "Fatality Rate"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))The visualization shows substantial variability between wind speed and aviation accident fatality rate, indicating that wind conditions alone do not fully explain accident severity. However, the fitted trend line suggests a slight positive relationship, wherein higher wind speeds may be associated with somewhat more severe crash outcomes on average. Despite this, the broad spread of observations demonstrates that aviation accident severity is likely associated with several factors rather than wind conditions alone.
After completing the exploratory analysis, a linear regression model will be constructed in order to examine whether the selected aviation and weather-related variables appear associated with aviation accident fatality rate. The response variable for the model is fatality_rate, while the explanatory variables include persons aboard, year, mean temperature, precipitation, and maximum wind speed.
fatality_model <- lm(fatality_rate ~ Aboard + Year + temperature_mean + precipitation_sum + wind_speed_max, data = aviation_weather)
summary(fatality_model)
Call:
lm(formula = fatality_rate ~ Aboard + Year + temperature_mean +
precipitation_sum + wind_speed_max, data = aviation_weather)
Residuals:
Min 1Q Median 3Q Max
-0.8569 -0.1309 0.1477 0.1780 1.6100
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3169853 0.6581659 2.001 0.0455 *
Aboard -0.0024308 0.0001411 -17.226 <2e-16 ***
Year -0.0002299 0.0003327 -0.691 0.4896
temperature_mean -0.0013051 0.0005492 -2.376 0.0176 *
precipitation_sum 0.0002704 0.0006015 0.450 0.6531
wind_speed_max 0.0007219 0.0007407 0.975 0.3298
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3126 on 2589 degrees of freedom
Multiple R-squared: 0.1095, Adjusted R-squared: 0.1078
F-statistic: 63.65 on 5 and 2589 DF, p-value: < 2.2e-16
The regression model produced a statistically significant overall result (F-statistic p-value < 2.2e-16), suggesting that the selected explanatory variables collectively exhibit some relationship with aviation accident fatality rate. However, the adjusted R-squared value of approximately 0.108 indicates that the model explains only a modest portion of the variability in fatality outcomes. This suggests that aviation accident severity is likely associated with additional operational, mechanical, environmental, and human-related factors not captured within the present analysis.
Among the explanatory variables, the number of persons aboard displayed a statistically significant negative relationship with fatality rate (p < 2e-16). However, this relationship should be interpreted cautiously, since the fatality rate variable was itself partially constructed using the number of persons aboard. As such, part of the observed statistical relationship may reflect the mathematical structure of the response variable rather than a purely independent operational relationship. Nevertheless, the result may still suggest that accidents involving larger commercial aircraft can exhibit somewhat lower proportional fatality outcomes on average.
Mean temperature also exhibited a statistically significant negative relationship with fatality rate (p = 0.0176), although the magnitude of the relationship remained relatively small. This may suggest that colder environmental conditions are modestly associated with more severe crash outcomes, though the relationship should still be interpreted cautiously.
In contrast, year (p = 0.4896), precipitation (p = 0.6531), and maximum wind speed (p = 0.3298) did not appear statistically significant within the fitted model. While the exploratory visualizations suggested that stronger wind conditions may exhibit a slight positive relationship with accident severity, the regression analysis indicates that this relationship weakens once multiple variables are considered simultaneously. This highlights the complexity of aviation accidents, where severity is likely associated with numerous interacting factors rather than any single variable alone.
After fitting the regression model, diagnostic plots were examined in order to assess whether the model residuals exhibited any major violations of linear regression assumptions. Particular attention was given to the distribution of residuals and the relationship between fitted values and residual spread.
par(mfrow = c(1, 2))
plot(
fatality_model,
which = 1
)
plot(
fatality_model,
which = 2
)par(mfrow = c(1, 1))The diagnostic plots suggest that the regression model captures some structure in the data, but several limitations remain. The residuals versus fitted plot shows clustering and uneven spread, while the Q-Q plot shows departures from normality, especially in the tails. This suggests that the relationship between the selected variables and fatality rate is not perfectly linear, which is expected given the complexity of aviation accident severity.
This project examined historical aviation accident data in order to explore whether variables such as persons aboard, year, and weather conditions appeared associated with accident fatality severity. Through the integration of a structured aviation crash dataset, geocoding techniques, and API-based historical weather information, the analysis demonstrated how multiple data sources can be combined within a reproducible workflow.
The exploratory analysis and regression modelling suggested that aviation accident severity is associated with substantial variability and complexity. Only some of the selected variables displayed statistically significant relationships, with the persons aboard result requiring additional caution because of how fatality rate was constructed. As such, the findings reinforce the idea that aviation accident severity is associated with numerous interacting operational, environmental, mechanical, and human-related factors rather than any single contributing condition.
Grandi, S. (n.d.). Airplane crashes since 1908 [Data set]. Kaggle. https://www.kaggle.com/datasets/saurograndi/airplane-crashes-since-1908
Open-Meteo. (n.d.). Historical weather API. https://open-meteo.com/en/docs/historical-weather-api
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz/