Rate of plane fatalities can be used to inform travel decisions. Flying is a safer form of travel.
Number of plane fatalities in last 100 years. Number of plane fatalities in last 100 years by region. Main outcome = fatalities (frequency, continuous) Factor 1 = Time (continuous) Factor 2 = Country (nominal)
Data source = http://www.planecrashinfo.com/database.htm
Observations = records of civil, commercial and military aviation flights worldwide from 1908 to present. Plane crashes with 10 or more fatalities.
Variables: Date = day-month-year Location / operator = region / flight operator Aircraft type / registration = type of aircraft / registration information Fatalities = onboard / fatilities (ground crew)
The data were scraped from a website, cleaned, and wrangled in R. The final variables in the dataset include:
Aviation remains one of the safest modes of transportation. But the heuristic effect can dissuade individuals from flying, despite excellent passenger safety records. The two questions I wanted to answer are:
#load libraries
library(rvest)
library(tidyverse)
library(here)
library(stringi)
library(stringr)
library(lubridate)
library(janitor)
library(maps)
library(ggthemes)
library(viridis)
library(USAboundaries)
## Generate empty dataframe to store data
#read html page
html <- read_html("http://www.planecrashinfo.com/1920/1920.htm")
#extract table
table_html <- html %>%
html_element(xpath ="/html/body/div/center/table" ) %>%
html_table(header = TRUE)
#extract column names and convert to list
col_names <- c(colnames(table_html) )
#generate a 1 x 4 matrix of zeros, then convert to dataframe
data <- matrix(0, 1, 4) %>%
data.frame() %>%
.[-1,] #remove first empty row
#add column names from target table
colnames(data) <- col_names
## Scrape data
#root html page:
root_html <- "http://www.planecrashinfo.com/"
#loop through each page and extract tables
#this step can take a few minutes
for (i in 1920:2021){
#html target
html_target <- paste0(root_html, i, "/", i, ".htm")
#import information from website
html_extract <- read_html(html_target)
#extract table
html_table <- html_extract %>%
html_element(xpath ="/html/body/div/center/table") %>%
html_table(header = TRUE, trim = TRUE)
#add new data to existing data
data <- rbind(data, html_table)
}
#save extracted data
data_raw <- data
data <- data_raw
## data cleaning
#clean variable names
data <- clean_names(data)
#parse date to YYYY-MM-DD format
data <- data %>%
mutate(date = dmy(date)) %>%
mutate(year = year(date)) %>%
mutate(month = month(date))
#split location and operator data:
data <- data %>%
mutate(location_operator_s = str_split(location_operator, pattern = "(?<=[a-z]{2})(?=[A-Z])"))
#rename and recombine new split column:
data <- data %>%
rename(location = "...1",
op1 = "...2",
op2 = "...3") %>%
mutate(operator = paste(op1, op2)) %>%
mutate(operator = str_remove_all(operator, "NA")) %>%
select(-op1, -op2, -location_operator,
date, operator, location, aircraft_type_registration, fatalities)
#split fatality
data <- data %>%
separate(col = fatalities, into = c("nfatality", "nonboard"), sep = "/") %>%
mutate(ngcrew = str_extract_all(nonboard, "(?<=\\().+?(?=\\))")) %>%
mutate(nonboard = str_remove_all(nonboard, "\\(.*"))
#create variable for military or non-military
data <- data %>%
mutate(military = if_else(str_detect(location, "Military - "), "military","notmillitary")) %>%
mutate(operator = str_remove_all(operator, "Military - "))
# create variable for location qualifier:
data <- data %>%
mutate(location_qualifer = case_when(str_detect(location, "Off ") ~ "off",
str_detect(location, "Near ") ~ "near",
str_detect(location, "Over the ") ~ "over")) %>%
mutate(location_qualifer = replace_na(location_qualifer, "none"))
#clean variables
data <- data %>%
mutate(location = str_remove_all(location, "Off ")) %>%
mutate(location = str_remove_all(location, "Near ")) %>%
mutate(location = str_remove_all(location, "Over the ")) %>%
mutate(nfatality = str_remove_all(nfatality, "[[:punct:]]")) %>%
mutate(nonboard = str_remove_all(nonboard, "[[:punct:]]")) %>%
mutate(ngcrew = str_remove_all(ngcrew, "[[:punct:]]"))
#extract region
data <- data %>%
mutate(location = str_extract(location, pattern = '\\b[^,]+$'))
#reorder variables
data <- data %>%
select(date, year, month, aircraft_type_registration, operator, military, location, location_qualifer,
nfatality, nonboard, ngcrew) %>%
mutate(nfatality = as.numeric(nfatality)) %>%
mutate(nonboard = as.numeric(nonboard)) %>%
mutate(ngcrew = as.numeric(ngcrew))
As I am interested in the number of fatalities over time, time will be represented along the x-axis whereas frequency of fatalities will be represented along the y-axis. Looking at the plot, there is clearly an increase in fatalities in the 1950s, but this has decreased substantially by the 2000s.
## Explore data
#generate totals for each year
year_sum <- data %>%
group_by(year) %>%
summarise(nfatality_total = sum(nfatality))
#plot deaths per year
ggplot(year_sum, aes(x = year, y = nfatality_total))+
geom_point()+
geom_smooth(method = "loess", se=TRUE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 8 rows containing non-finite values (stat_smooth).
## Warning: Removed 8 rows containing missing values (geom_point).
#clean data for regional plot
#clean USA data
usa_states <- us_states()
usa_states <- usa_states$state_name
#i know i can use case_when.
#but i'm getting an error when i try to str_detect over vector usa_states
#so i have to use this solution instead
usa_states <- usa_states %>%
data.frame() %>%
rename(location = ".")
#extract only USA sates
usa_regions <- inner_join(data,
usa_states,
by="location")
#replace all USA states with USA
usa_regions <- usa_regions %>%
mutate(location = "USA")
#extract non-USA regions
non_usa_regions <- anti_join(data,
usa_states,
by="location")
#recombine data
data <- rbind(usa_regions, non_usa_regions)
#summarise fatality data by region
fatality_summary <- data %>%
group_by(location) %>%
summarise(nfatality = sum(nfatality))%>%
rename("region" = "location")
As I am interested in the number of fatalities across region, I need to first specify the longitudinal coordinates along the x-acis and the latitude along the y-aixs. I will group regions by group, and use the number of fatalities as the fill variable. This will colour different regions by number of fatalities.
Using the two plots, it is clear that some regions account for a higher propotion of deaths including the USA, Russia, and some parts of northern South America. This is probably associated wit the population size/density of those regions. Other regions such as Oceania, Asia, Africa, Canada and Sourthern South America have fewer fatalities. Note that there is no data for some regions such as China.
#load world map
world_map <- map_data("world")
#map fatality
fatality_map <- inner_join(fatality_summary,
world_map,
by = "region")
#plot world maps
ggplot(fatality_map)+
geom_polygon(aes(x = long, y = lat, group=group, fill=nfatality))+
theme_map()
#plot world maps with different colours
ggplot(fatality_map)+
geom_polygon(aes(x = long, y = lat, group=group, fill=nfatality))+
theme_map()+
scale_fill_viridis(na.value="orange")+
annotate("text",
x = -160, y = -20,
label = "Orange = No data", color="orange", fontface=2, size=4)