Step 1: Start with a problem

Rate of plane fatalities can be used to inform travel decisions. Flying is a safer form of travel.

Step 2: Elicit questions

Number of plane fatalities in last 100 years. Number of plane fatalities in last 100 years by region. Main outcome = fatalities (frequency, continuous) Factor 1 = Time (continuous) Factor 2 = Country (nominal)

Step 3: Data sources and data description

Data source = http://www.planecrashinfo.com/database.htm

Observations = records of civil, commercial and military aviation flights worldwide from 1908 to present. Plane crashes with 10 or more fatalities.

Variables: Date = day-month-year Location / operator = region / flight operator Aircraft type / registration = type of aircraft / registration information Fatalities = onboard / fatilities (ground crew)

The data

The data were scraped from a website, cleaned, and wrangled in R. The final variables in the dataset include:

date = year-month-day
year = year
month = month
aircraft_type_registration = type of aircraft, type of registration
operator = flight operator
military = military or non-military flight
location = region
nfatality = number of fatalities
nonboard = number on board
ngcrew = number of ground crew
= whether the flight was over, near or none

Approaches to exploring the data

Aviation remains one of the safest modes of transportation. But the heuristic effect can dissuade individuals from flying, despite excellent passenger safety records. The two questions I wanted to answer are:

Has the frequency of aviation fatalities changed over time?
Does the frequency of aviation fatalities vary by region?

#load libraries
library(rvest)
library(tidyverse)
library(here)
library(stringi)
library(stringr)
library(lubridate)
library(janitor)
library(maps)
library(ggthemes)
library(viridis)
library(USAboundaries)

## Generate empty dataframe to store data

#read html page
html <- read_html("http://www.planecrashinfo.com/1920/1920.htm")

#extract table
table_html  <- html %>% 
  html_element(xpath ="/html/body/div/center/table" ) %>% 
  html_table(header  = TRUE) 

#extract column names and convert to list
col_names <- c(colnames(table_html) )
  
#generate a 1 x 4 matrix of zeros, then convert to dataframe
data <- matrix(0, 1, 4) %>% 
  data.frame() %>% 
  .[-1,] #remove first empty row
  
#add column names from target table
colnames(data) <- col_names

## Scrape data

#root html page:
root_html <- "http://www.planecrashinfo.com/"

#loop through each page and extract tables
#this step can take a few minutes
for (i in 1920:2021){
  
  #html target
  html_target <- paste0(root_html, i, "/", i, ".htm")
  
  #import information from website
  html_extract <- read_html(html_target)
  
  #extract table
  html_table  <- html_extract %>%
    html_element(xpath ="/html/body/div/center/table") %>% 
    html_table(header  = TRUE, trim = TRUE) 
  
  #add new data to existing data
  data <- rbind(data, html_table)
  
}

#save extracted data
data_raw <- data
data <- data_raw

## data cleaning

#clean variable names
data <- clean_names(data)

#parse date to YYYY-MM-DD format
data <- data %>% 
  mutate(date = dmy(date))   %>% 
  mutate(year = year(date))  %>% 
  mutate(month = month(date))

#split location and operator data:
data <- data %>% 
  mutate(location_operator_s = str_split(location_operator, pattern = "(?<=[a-z]{2})(?=[A-Z])"))

#rename and recombine new split column:
data <- data %>% 
  rename(location = "...1",
         op1 = "...2", 
         op2 = "...3") %>% 
  mutate(operator = paste(op1, op2)) %>%
  mutate(operator = str_remove_all(operator, "NA")) %>% 
  select(-op1, -op2, -location_operator,
         date, operator, location, aircraft_type_registration, fatalities)

#split fatality
data <- data %>% 
  separate(col    = fatalities, into = c("nfatality", "nonboard"), sep = "/")  %>% 
  mutate(ngcrew   = str_extract_all(nonboard, "(?<=\\().+?(?=\\))")) %>%
  mutate(nonboard = str_remove_all(nonboard, "\\(.*")) 

#create variable for military or non-military
data <- data %>% 
  mutate(military = if_else(str_detect(location, "Military - "), "military","notmillitary")) %>%
  mutate(operator = str_remove_all(operator, "Military - "))


# create variable for location qualifier:
data <- data %>% 
  mutate(location_qualifer = case_when(str_detect(location, "Off ")  ~ "off", 
                                       str_detect(location, "Near ") ~ "near", 
                                       str_detect(location, "Over the ") ~ "over")) %>% 
  mutate(location_qualifer = replace_na(location_qualifer, "none")) 

#clean variables
data <- data %>%  
  mutate(location   = str_remove_all(location, "Off "))  %>% 
  mutate(location   = str_remove_all(location, "Near ")) %>% 
  mutate(location   = str_remove_all(location, "Over the "))     %>% 
  mutate(nfatality  = str_remove_all(nfatality,  "[[:punct:]]")) %>% 
  mutate(nonboard   = str_remove_all(nonboard, "[[:punct:]]"))   %>% 
  mutate(ngcrew     = str_remove_all(ngcrew,  "[[:punct:]]"))

#extract region
data <- data %>% 
  mutate(location = str_extract(location, pattern = '\\b[^,]+$'))

#reorder variables
data <- data %>% 
  select(date, year, month, aircraft_type_registration, operator, military, location, location_qualifer,
         nfatality, nonboard, ngcrew)       %>% 
  mutate(nfatality = as.numeric(nfatality)) %>% 
  mutate(nonboard  = as.numeric(nonboard))  %>% 
  mutate(ngcrew  = as.numeric(ngcrew))

Answering the two questions with plots

Has the frequency of aviation fatalities changed over time?

Aesthetic Mappings

As I am interested in the number of fatalities over time, time will be represented along the x-axis whereas frequency of fatalities will be represented along the y-axis. Looking at the plot, there is clearly an increase in fatalities in the 1950s, but this has decreased substantially by the 2000s.

## Explore data

#generate totals for each year
year_sum <- data %>% 
  group_by(year) %>% 
  summarise(nfatality_total = sum(nfatality))


#plot deaths per year
ggplot(year_sum, aes(x = year, y = nfatality_total))+
  geom_point()+
  geom_smooth(method = "loess", se=TRUE)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 8 rows containing non-finite values (stat_smooth).

## Warning: Removed 8 rows containing missing values (geom_point).

#clean data for regional plot

#clean USA data
usa_states <- us_states() 
usa_states <- usa_states$state_name 

#i know i can use case_when. 
#but i'm getting an error when i try to str_detect over vector usa_states
#so i have to use this solution instead 
usa_states <- usa_states %>% 
  data.frame() %>% 
  rename(location = ".")

#extract only USA sates
usa_regions <- inner_join(data, 
                          usa_states,
                          by="location")

#replace all USA states with USA
usa_regions <- usa_regions %>% 
  mutate(location = "USA")

#extract non-USA regions
non_usa_regions <- anti_join(data, 
                             usa_states,
                             by="location")

#recombine data
data <- rbind(usa_regions, non_usa_regions)

#summarise fatality data by region
fatality_summary <- data %>% 
  group_by(location) %>% 
  summarise(nfatality = sum(nfatality))%>% 
  rename("region" = "location")

Answering the two questions with plots

Does the frequency of aviation fatalities vary by region?

Aesthetic Mappings

As I am interested in the number of fatalities across region, I need to first specify the longitudinal coordinates along the x-acis and the latitude along the y-aixs. I will group regions by group, and use the number of fatalities as the fill variable. This will colour different regions by number of fatalities.

Using the two plots, it is clear that some regions account for a higher propotion of deaths including the USA, Russia, and some parts of northern South America. This is probably associated wit the population size/density of those regions. Other regions such as Oceania, Asia, Africa, Canada and Sourthern South America have fewer fatalities. Note that there is no data for some regions such as China.

#load world map
world_map <- map_data("world")

#map fatality
fatality_map <- inner_join(fatality_summary, 
                           world_map, 
                           by = "region") 

#plot world maps
ggplot(fatality_map)+
  geom_polygon(aes(x = long, y = lat, group=group, fill=nfatality))+
  theme_map()

#plot world maps with different colours
ggplot(fatality_map)+
  geom_polygon(aes(x = long, y = lat, group=group, fill=nfatality))+
  theme_map()+
  scale_fill_viridis(na.value="orange")+
  annotate("text", 
           x = -160, y = -20, 
           label = "Orange = No data", color="orange", fontface=2, size=4)

Plane Fatalities

Michael Do

07-08-2021

Step 1: Start with a problem

Step 2: Elicit questions

Step 3: Data sources and data description

The data

Approaches to exploring the data

Answering the two questions with plots

Has the frequency of aviation fatalities changed over time?

Aesthetic Mappings

Answering the two questions with plots

Does the frequency of aviation fatalities vary by region?

Aesthetic Mappings