Synopsis:

The purpose of this project is to explore the online available data on airplane crashes since 1908 and answer questions like -

  • Which countries saw most of the airplane crashes in the history?
  • How have these crashes and fatalities due to them varied with time?
  • What was the major reason for these crashes and has this changed with time?

While answering these questions the report also illustrates getting data from open online sources, challenges faced in cleaning the data and visualization of results.

Dataset and source

Source

opendata.socrata.com - Colloection of publically available datasets available online for exploring and analysis

Dataset used

Airplane Crashes and Fatalities Since 1908 - Which contains full history of airplane crashes throughout the world since 1908 to 2009. The dataset is hosted by Open Data by socrata

Data Description

Variable Description Sample Values
Date Date of airplane crash Spet 1908 to June 2009
Time Time of accident time in 24 hr format
Location Location of accident City, country e.g. Cleveland, Ohio OR nearest landmark e.g. Off West Hartlepool, England
Operator Ownner of the aircraft e.g. Military - German Navy, Private, US Aerial Mail Service
Flight # Unique flight number given to each flight e.g. 2L272, 739/14
Route The source and destination of the flight E.G. Lympne, England - Rotterdam, The Netherlands
Aboard Number of people aboard integer values
Fatalities Number of deaths integer values (Check - should be less than or equal to Aboard)
Summary summary of the incident / information known afterwards e.g. Shot down by British aircraft.

The Codebook and details about how each value was recorded is not available at the data source. The above explation of each field was gathered by insepecting the values in each field.

Libraries Used

library(RSocrata) # To get airplane crash data from opendata.socrata.com
library(readr) # To get online csv files for countries, states and cities mapping
library(tibble) # Convert tables tibble
library(dplyr) # Data Manipulations such as joins, mutate, group_by, summarize, etc.
library(tidyverse) # Tidying data
library(lubridate) # Use functions to convert strings to date and time formats
library(stringr) # To do string operations using regualr expressions
library(stringi) # Converting accented characters to english characters
library(wordcloud2) # Making wordcloud
library(wordcloud) # Making wordcloud
library(maps) # plotting maps
library(tm) # Text mining on the summary column to prepare data for word cloud
library(DT) # View interactive HTML table in the output

Dataset Download

Data_url <- "https://opendata.socrata.com/Government/Airplane-Crashes-and-Fatalities-
             Since-1908/q2te-8cvq"

Crash_raw_data <- read.socrata(Data_url) %>% 
  as_tibble()

Data Cleaning

The Following steps were taken for data cleaning: –

  1. It was observed that same values in the operator and type fields were recorded in both upper and lower case

    • All the values in these fields were converted to upper case
    • Number of distinct values in ‘Operator’ field reduced from 2477 to 2407
    • Number of distinct values in ‘Type’ filed reduced from 2447 to 2409
  2. The Date and Time values were Strings

    • These values were converted to date and time format
    • Missing Time values were assumed to be 0 Hrs and replaced
  3. Extracting Country, State and city from the location column

    • Challenges - There is no particular pattern for the values of this field
    • For some countries (e.g. United States), country name is not present and only state name is present
    • The approach taken here is to match the location text with list of all possible country names and extract the country name which matches
    • To get the list of countries, states, cities and their latitude and Longitude (used for plotting maps later), online map data is used
# Making these fields to upper case to remove inconsistencies
Crash_raw_data <- Crash_raw_data %>% 
  mutate(Operator = str_to_upper(Operator,locale="en"),
         Type = str_to_upper(Type,locale="en"))

# Convert Date and time fields from string to date and time type
crash_data_datefrmt <- Crash_raw_data %>% 
  mutate(Date=mdy(Date),
         Time=hm(Time)) 

# Replace missing 'Time' values to 0 Hours (midnight)
crash_data_datefrmt[is.na(crash_data_datefrmt$Time),"Time"] <- hms("00:00:00")


#---------Extract Crash Country, State and City from the 'Location' column-------#

# To get the list of countries, states, cities and their latitude and Longitude online map data is used

# Data download
cities_url <- "http://simplemaps.com/static/demos/resources/world-cities/world_cities.csv"
cities_data <- read_csv(cities_url) %>% 
  select(city,lat,lng,country,province) %>%  # Keeping only required Columns
  mutate(country=ifelse(country=="United States of America","United States",country),# Inconsistencies in country names
         country=ifelse(country=="United States Virgin Islands","Virgin Islands",country),
         country=ifelse(country=="United Kingdom","England",country),
         country=ifelse(country=="Bulgaria","Bulg[ea]ria",country),# Remove Spelling inconsistencies in country names
         country=ifelse(country=="India","India$",country)) # Scenario - Indiana Does not extract India as country
head(cities_data)
## # A tibble: 6 × 5
##            city     lat     lng     country province
##           <chr>   <dbl>   <dbl>       <chr>    <chr>
## 1 Qal eh-ye Now 34.9830 63.1333 Afghanistan  Badghis
## 2   Chaghcharan 34.5167 65.2500 Afghanistan     Ghor
## 3   Lashkar Gah 31.5830 64.3600 Afghanistan  Hilmand
## 4        Zaranj 31.1120 61.8870 Afghanistan   Nimroz
## 5    Tarin Kowt 32.6333 65.8667 Afghanistan  Uruzgan
## 6  Zareh Sharan 32.8500 68.4167 Afghanistan  Paktika
# Generate string of countries to extract
world_countries <- cities_data %>% 
  select(country) %>% 
  distinct(country) %>%
  str_c() %>% 
  str_replace_all('\"',"") %>% 
  str_replace_all(", ","|") %>% 
  str_replace_all("\n","") %>% 
  str_sub(start=3,end=-2)
world_countries
## [1] "Afghanistan|Aland|Albania|Algeria|American Samoa|Andorra|Angola|Antarctica|Antigua and Barbuda|Argentina|Armenia|Aruba|Australia|Austria|Azerbaijan|Bahrain|Bangladesh|Barbados|Belarus|Belgium|Belize|Benin|Bermuda|Bhutan|Bolivia|Bosnia and Herzegovina|Botswana|Brazil|Brunei|Bulg[ea]ria|Burkina Faso|Burundi|Cambodia|Cameroon|Canada|Cape Verde|Cayman Islands|Central African Republic|Chad|Chile|China|Colombia|Comoros|Congo (Brazzaville)|Congo (Kinshasa)|Cook Islands|Costa Rica|Croatia|Cuba|Curacao|Cyprus|Czech Republic|Denmark|Djibouti|Dominica|Dominican Republic|East Timor|Ecuador|Egypt|El Salvador|Equatorial Guinea|Eritrea|Estonia|Ethiopia|Falkland Islands|Faroe Islands|Federated States of Micronesia|Fiji|Finland|France|French Polynesia|Gabon|Georgia|Germany|Ghana|Gibraltar|Greece|Greenland|Grenada|Guam|Guatemala|Guinea|Guinea Bissau|Guyana|Haiti|Honduras|Hong Kong S.A.R.|Hungary|Iceland|India$|Indonesia|Iran|Iraq|Ireland|Isle of Man|Israel|Italy|Ivory Coast|Jamaica|Japan|Jordan|Kazakhstan|Kenya|Kiribati|Kosovo|Kuwait|Kyrgyzstan|Laos|Latvia|Lebanon|Lesotho|Liberia|Libya|Liechtenstein|Lithuania|Luxembourg|Macau S.A.R|Macedonia|Madagascar|Malawi|Malaysia|Maldives|Mali|Malta|Marshall Islands|Mauritania|Mauritius|Mexico|Moldova|Monaco|Mongolia|Montenegro|Morocco|Mozambique|Myanmar|Namibia|Nepal|Netherlands|New Caledonia|New Zealand|Nicaragua|Niger|Nigeria|North Korea|Northern Cyprus|Northern Mariana Islands|Norway|Oman|Pakistan|Palau|Palestine|Panama|Papua New Guinea|Paraguay|Peru|Philippines|Poland|Portugal|Puerto Rico|Qatar|Romania|Russia|Rwanda|Saint Kitts and Nevis|Saint Lucia|Saint Vincent and the Grenadines|Samoa|San Marino|Sao Tome and Principe|Saudi Arabia|Senegal|Serbia|Seychelles|Sierra Leone|Singapore|Slovakia|Slovenia|Solomon Islands|Somalia|Somaliland|South Africa|South Georgia and the Islands|South Korea|South Sudan|Spain|Sri Lanka|Sudan|Suriname|Svalbard and Jan Mayen Islands|Swaziland|Sweden|Switzerland|Syria|Taiwan|Tajikistan|Tanzania|Thailand|The Bahamas|The Gambia|Togo|Tonga|Trinidad and Tobago|Tunisia|Turkey|Turkmenistan|Turks and Caicos Islands|Tuvalu|Uganda|Ukraine|United Arab Emirates|England|United States|Virgin Islands|Uruguay|Uzbekistan|Vanuatu|Vatican (Holy Sea)|Venezuela|Vietnam|Western Sahara|Yemen|Zambia|Zimbabwe"
# Similarly create string for State and Cities (Only US considered for this level of analysis)
world_provinces <- cities_data %>% 
  filter(country %in% c("United States")) %>%
  select(province) %>% 
  distinct(province) %>% 
  arrange(province) %>% 
  str_c() %>% 
  str_replace_all('\"',"") %>% 
  str_replace_all(", ","|") %>% 
  str_replace_all("\n","") %>% 
  str_sub(start=3,end=-2)

world_cities <- cities_data %>%
  filter(country=="United States") %>% 
  select(city) %>% 
  distinct(city) %>% 
  filter(str_length(city)>3) %>% # Remove cities with less than 3 letters name (avoid wrong match)
  stri_trans_general("Latin-ASCII") %>%
  str_c() %>% 
  str_replace_all('\"',"") %>% 
  str_replace_all(", ","|") %>% 
  str_replace_all("\n","") %>% 
  str_sub(start=3,end=-2)

# Match and Extract Country, State and city from the 'Location' column
Add_crash_country <- crash_data_datefrmt %>% 
  select(-Flight..,-Route,-Registration,-cn.In,-Ground) %>% 
  mutate(
    Crash_country = str_to_upper(str_extract(Location,regex(world_countries,ignore_case = TRUE))),
    crash_state_US = str_to_upper(str_extract(Location,regex(world_provinces, ignore_case = TRUE))),
    crash_city_US = str_to_upper(str_extract(Location,regex(world_cities, ignore_case = TRUE))),
    Crash_country = ifelse(!is.na(crash_state_US) & is.na(Crash_country),"UNITED STATES",Crash_country)) # Country name for US not present so adding based on State name

# Check the number of observations and location fields for which the coutry mapping is not present
not_extracted <- Add_crash_country %>% 
  filter(is.na(Crash_country))
nrow(not_extracted)
## [1] 436
# 436 out of 5268 (8%) observations where the country location was not extracted. Can be improved by adding specific rules for unextracted countries

-436 out of 5268 (8.2763857%) observations where the country location was not extracted. Can be improved by adding specific rules for unextracted countries

-Online map data for world cities - Data for mapping of Countries, States, Cities and their latitude/longitude.

# DT::datatable(Add_crash_country,options= list(pageLength=5))
# knitr::kable(head(Add_crash_country,n=5))
head(Add_crash_country)
## # A tibble: 6 × 11
##         Date         Time                           Location
##       <date> <S4: Period>                             <fctr>
## 1 1908-09-17   17H 18M 0S                Fort Myer, Virginia
## 2 1912-07-12    6H 30M 0S            AtlantiCity, New Jersey
## 3 1913-08-06           0S Victoria, British Columbia, Canada
## 4 1913-09-09   18H 30M 0S                 Over the North Sea
## 5 1913-10-17   10H 30M 0S         Near Johannisthal, Germany
## 6 1915-03-05     1H 0M 0S                    Tienen, Belgium
## # ... with 8 more variables: Operator <chr>, Type <chr>, Aboard <int>,
## #   Fatalities <int>, Summary <fctr>, Crash_country <chr>,
## #   crash_state_US <chr>, crash_city_US <chr>
summary(Add_crash_country)
##       Date                 Time                         
##  Min.   :1908-09-17   Min.   :0S                        
##  1st Qu.:1954-04-11   1st Qu.:0S                        
##  Median :1973-03-04   Median :6H 15M 0S                 
##  Mean   :1971-10-23   Mean   :7H 41M 54.8633257403199S  
##  3rd Qu.:1990-06-26   3rd Qu.:14H 50M 0S                
##  Max.   :2009-06-08   Max.   :4d 18H 20M 0S             
##                                                         
##                    Location      Operator             Type          
##                        :  20   Length:5268        Length:5268       
##  Sao Paulo, Brazil     :  15   Class :character   Class :character  
##  Moscow, Russia        :  15   Mode  :character   Mode  :character  
##  Rio de Janeiro, Brazil:  14                                        
##  Bogota, Colombia      :  13                                        
##  Manila, Philippines   :  13                                        
##  (Other)               :5178                                        
##      Aboard         Fatalities                         Summary    
##  Min.   :  0.00   Min.   :  0.00                           : 390  
##  1st Qu.:  5.00   1st Qu.:  3.00   Crashed during takeoff. :  15  
##  Median : 13.00   Median :  9.00   Crashed into a mountain.:  12  
##  Mean   : 27.55   Mean   : 20.07   Crashed into the sea.   :  12  
##  3rd Qu.: 30.00   3rd Qu.: 23.00   Crashed while en route. :  12  
##  Max.   :644.00   Max.   :583.00   Crashed on takeoff.     :  11  
##  NA's   :22       NA's   :12       (Other)                 :4816  
##  Crash_country      crash_state_US     crash_city_US     
##  Length:5268        Length:5268        Length:5268       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 

Analysis

By Geography

# Number of fatalities and incidents by country
Fatalities_by_countrty <- Add_crash_country %>% 
  group_by(Crash_country) %>% 
  summarize(Fatalities=sum(Fatalities,na.rm=TRUE),
            Incidents=n())

# Data for the map using the maps package
all_states <- map_data("world") %>% 
  mutate(region=ifelse(region=="USA","United States",region),
         region=ifelse(region=="UK","England",region),
         region=str_to_upper(region))

# Plotting number of air crash Incidents in each country
all_states %>% 
  left_join (Fatalities_by_countrty, by = c("region"="Crash_country")) %>% 
  filter(lat > -50 & lat <90 & long>-130 & long < 170) %>% 
  ggplot() +
  geom_polygon(aes(x=long,y=lat,group=group,fill=Incidents))+
  scale_fill_gradient(low="#FF9933",high = "#993300",space="Lab", na.value ="grey50")+
  theme(legend.position = c(0.95,0.5))+
  ggtitle("Number of air crash incidents in every country")+
  theme(axis.line=element_blank(),
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks=element_blank(),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

# Plotting number of Fatalities for each country
all_states %>% 
  left_join (Fatalities_by_countrty, by = c("region"="Crash_country")) %>% 
  filter(lat > -50 & lat <90 & long>-130 & long < 170) %>% 
  ggplot() +
  geom_polygon(aes(x=long,y=lat,group=group,fill=Fatalities))+
  scale_fill_gradient(low="#56B157",high = "black",space="Lab",na.value ="grey50")+
  theme(legend.position = c(0.95,0.5))+
  ggtitle("Number of air crash fatalities in every country")+
  theme(axis.line=element_blank(),
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks=element_blank(),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

Exploring the United states Incidents in detail

# US in detail

# latitude and Longitude for each city of United States
cities_for_join <- cities_data %>%
  filter(country == "United States") %>% 
  mutate(city = str_to_upper(city)) %>%
  select(city,lng,lat) %>% 
  filter(lat > 25 & lat < 50 & lng > -124 & lng < -68)

# Citywise Fatalities and number of crashes 
# joined with latitude and longidue for plottong map
City_wise <- Add_crash_country %>% 
  filter(!is.na(crash_city_US) & !is.na(crash_state_US)) %>% 
  group_by(crash_city_US) %>% 
  summarize(Fatalities=sum(Fatalities,na.rm=TRUE),Number_crashes = n()) %>% 
  left_join(cities_for_join,by = c("crash_city_US"="city"))

# US map data
us_states <- map_data("state")

# Number of crashes For different cities of united states
us_states %>% 
  ggplot()+
  geom_polygon(aes(x=long,y=lat,group=group),colour="white",size=0.3)+
  geom_point(data=City_wise, aes(x=lng, y=lat, size=Number_crashes), colour="red")+
  geom_text(data=City_wise,hjust=0.7, vjust=-0.7, aes(x=lng, y=lat, label=crash_city_US),colour="gold2",size=1.9)+
  theme(legend.position = c(0.9,0.2)) +
  guides(size=guide_legend(title="Crashes"))+
  ggtitle("Number of crashes for cities in United States")+
  theme(axis.line=element_blank(),
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks=element_blank(),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

# Number of Fatalities For different cities of united states
us_states %>% 
  ggplot()+
  geom_polygon(aes(x=long,y=lat,group=group),colour="white",size=0.3)+
  geom_point(data=City_wise, aes(x=lng, y=lat, size=Fatalities,colour= Fatalities))+
  scale_colour_gradient(low="#FFCCCC",high="#FF0000")+
  geom_text(data=City_wise,hjust=0.7, vjust=-0.7, aes(x=lng, y=lat, label=crash_city_US),colour="gold2",size=1.7)+
  theme(legend.position = c(0.95,0.2)) +
  guides(size=guide_legend(title="Fatalities"))+
  ggtitle("Number of fatalities for cities in United States")+
  theme(axis.line=element_blank(),
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks=element_blank(),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

Over Time

# World-wide Number of Crashes over time
Add_crash_country %>% 
  mutate(Crash.Year=year(Date)) %>% 
  group_by(Crash.Year) %>% 
  summarize(No.Of.Crashes=n(),
            No.Fatalities=sum(Fatalities)) %>% 
  ggplot() +
  geom_line(aes(x=Crash.Year,y=No.Of.Crashes),size=1)+
  geom_smooth(aes(x=Crash.Year,y=No.Of.Crashes),colour="red")+
  ggtitle("World-wide Number of Crashes over time")+
  labs(x="Year",y="Number of Crashes")

# World-wide number of fatalities over  time
Add_crash_country %>% 
  mutate(Crash.Year=year(Date)) %>% 
  group_by(Crash.Year) %>% 
  summarize(No.Of.Crashes=n(),
            No.Fatalities=sum(Fatalities)) %>% 
  ggplot() +
  geom_line(aes(x=Crash.Year,y=No.Fatalities),colour="red",size=1)+
  geom_smooth(aes(x=Crash.Year,y=No.Fatalities),colour="black")+
  ggtitle("World-wide Number of fatalities over time")+
  labs(x="Year",y="Fatalities")

Text Analysis

Analysis of the summary field which gives the description of every incident to understand the major reasons of failures during different periods

# Word cloud showing Summary of crash for different time periods
make_word_cloud <- function(start_year,end_year)
{
  crash_decade_data <- Add_crash_country %>% 
    filter(year(Date)>= start_year & year(Date)<= end_year)
  
  docs <- Corpus(VectorSource(crash_decade_data$Summary))
  summary(docs)
  
  MyStopWords <- c("aircraft","plane","crashed","flight","pilot")
  
  dtm <- docs %>% 
    tm_map(removePunctuation) %>%
    tm_map(removeNumbers) %>% 
    tm_map(tolower) %>% 
    tm_map(removeWords,c(stopwords("english"),MyStopWords)) %>% 
    tm_map(stripWhitespace) %>% 
    tm_map(PlainTextDocument) %>% 
    DocumentTermMatrix()
  
  m <- as.matrix(dtm)
  v <- sort(colSums(m),decreasing=TRUE)
  head(v,14)
  words <- names(v)
  d <- data.frame(word=words, freq=v)
  
  d <- d %>%
    head(200)
  
wordcloud(d$word,d$freq,min.freq=50,colors=brewer.pal(8, "Dark2"),scale=c(4,0.4),
            max.words=150,random.order=FALSE,rot.per=0.05 ,use.r.layout=FALSE)
}

Crash Summary for time period: 1905 - 1915 (First ten years after the invention by wright brothers)

make_word_cloud(1905,1915)

Crash Summary for time period: 1916 - 1938 (World War I and Period after that)

make_word_cloud(1916,1938)

Crash Summary for time period: 1939 - 1945 (World War II)

make_word_cloud(1939,1945)

Summary description cloud shows war related words like - ‘Shot’, ‘Military’, etc. as major reason for crashes

Crash Summary for time period: 1946 - 1970

make_word_cloud(1946,1970)

Crash Summary for time period: 1971 - 1990

make_word_cloud(1971,1990)

Crash Summary for time period: 1991 - 2009

make_word_cloud(1991,2009)

Findings and Insights

  • United States has seen significantly more number of crashes and fatalities resulting from them than any other country. This can be attributed to the fact that invention of airplanes was done in United States therefore it saw lots of early failures.

  • Although there have been many incidients in United States, most of them have resulted in no fatalities

  • There was a peak observed in number of incidents and fatalities during the period 1965-1975 and since there has been decline in both these metrics

  • The text analysis for summary of incidents in the first decade of invention show major incidents due to equipment failure ( words like - Hydrogen, Gas, Ignited, Exploded are unique to this period)

  • The text analysis of summary of incidents during the World War II period show major accidents due to war and military ( words like - Shot, Military, Bomber are unique to this period)

  • The text analysis for the incidents of past 20 years show that major reasons are landing and takeoff on runway.