1 Executive Summary

The aim of this report is to investigate the survival rate, crash rate and causation of airplane crashes. The data set includes every recorded plane crash between January 1908 to June 2009. A report such as this can be used to identify major areas for improvement, allowing the correct allocation of resources in airplane development and management. It can also be used to further understand the influence of technological advancements on society over the past century. This report is also useful for customers in deciding which airline is best to fly with. Due to these reasons, stakeholders may include airlines and aeronautical engineers.

During initial data analysis, some critical changes were made to create a data set appropriate to apply to the research questions. Firstly, the survival rates of zero were removed as many of the causing factors were beyond the scope of the research questions. Secondly, as the first research question focused solely on commercial flights (with a minimum threshold of 18 passengers), those with 18 or less passengers aboard were removed. Further, ground fatalities were not taken into consideration as, again, it was beyond the scope of the research questions.

After thorough analysis of the data, it has been discovered that the rate of crashes has reduced, and the survival rate has increased in recent years. This is most likely due to advancements in technology. Another key discovery was that problems with the wing of the plane were a major causing factor for airplane crashes. This information can be used as the starting point for further investigations to reduce the frequency of plane crashes.



2 Full Report

2.1 Initial Data Analysis (IDA)

Initial Data analysis

This is a public dataset named “Airplane Crashes and Fatalities since 1908” (Full history of airplane crashes throughout the world, from 1908-present) and is hosted by open data by Socrata. This is a very large dataset and would be very difficult to make a judgment on every airline thus four operators where chosen as the focus. The Operators where chosen as the three airlines: Areoflot (Europe), Pan American Worldwide (America) and Philippine Airlines (Asia), where each from a separate continent and then the U.S air force was good military comparison.

This data is reasonably valid as it is an accurate historical record of fatalities caused by aircraft rather than an opinion based dataset. Its validity is improved by how extensive and detailed the data is allowing the viewer to analyse the data and come away with a multitude of extrapolations.

There are a few possible issues with this data set, the main ones that we are wary of are: false attributing the cause of a crash, unconfirmed crashes and inaccurate data from older crashes.

A high number of crashes resulted in every single person on board the aircraft perishing. Due to this it can be difficult to properly attribute the actual cause of the crash. Additionally searching for key words in the data allowed each entry to be filtered and as such the crash may have given a cause such as loss in engine speed but actually was caused by pilot error.

A number of the entries have planes that took off but never reached their destination but the wreckage was never found or was found much later. This makes the data slightly unreliable as the cause can only be hypothesised and it is difficult to confirm exactly how many people died as a result of the accident or if some escaped.

Some of the older flights may have had inaccurate or non-existent passenger manifests particularly on private flights. This is difficult to ascertain as it doesn’t explicitly say whether or not they where sure on how many people where aboard.

Each row of this dataset represents a separate aircraft accident that resulted in a fatality. Columns are as follows:

• Date – The date at which the accident occurred, written as month, day and year

• Time – The time at which the accident occurred

• Location – The approximate area in which the plane had the accident

• Operator – What company or group the aircraft was registered too

• Flight – The designated name of the flight path the aircraft was on e.g MH370

• Route – The flight path the aircraft was traveling on.

• Type – The model of aircraft involved in the accident

• Registration – The alpha-numeric that the aircraft is registered as

• Cn/ln – An abbreviation for construction number and line number, this denotes the number assigned to the aircraft production line and what number the specific aircraft rolled off the production line.

• Aboard – The number of people that were inside the aircraft at the time of the accident.

• Fatalities – Number of passengers that where killed as a result of the accident

• Ground – Number of fatalities that were on the ground and died as a result of the accident.

• Summary – A short written summary of what caused the accident to occur.

A number of columns where created from analysis of the data.

• Survival – The fraction of people that survived a particular accident in comparison to the number of people on board

• Power – Accidents that where attributed to a loss in engine power

• Struck – Accidents that where caused my something hitting the aircraft

• Wing – Accidents where the primary cause was mechanical issues with the wings

• Fuel – Accidents that where caused by a loss of fuel

• Trees – Accidents were the aircraft impacted trees

• Failed – Accidents where the aircraft failed to take off

• Shot down – were the aircraft was fired upon by another party (land or air) causing it to crash

• Cargo – Accidents where the cargo caused the aircraft to crash

• Error – Accidents attributed to pilot error.

# LOAD DATA v1 - uncomment the link below to: load data direct from html
#cars = read.csv("dataset URL")

# LOAD DATA v2 - uncomment the link below to: load data from local file

crashes = read.csv("Airplane_Crashes_and_Fatalities_Since_1908.csv")

# Quick look at top 5 rows of data
head(crashes)
##         Date  Time                           Location
## 1 09/17/1908 17:18                Fort Myer, Virginia
## 2 07/12/1912 06:30            AtlantiCity, New Jersey
## 3 08/06/1913       Victoria, British Columbia, Canada
## 4 09/09/1913 18:30                 Over the North Sea
## 5 10/17/1913 10:30         Near Johannisthal, Germany
## 6 03/05/1915 01:00                    Tienen, Belgium
##                 Operator Flight..         Route                   Type
## 1   Military - U.S. Army          Demonstration       Wright Flyer III
## 2   Military - U.S. Navy            Test flight              Dirigible
## 3                Private        -                     Curtiss seaplane
## 4 Military - German Navy                        Zeppelin L-1 (airship)
## 5 Military - German Navy                        Zeppelin L-2 (airship)
## 6 Military - German Navy                        Zeppelin L-8 (airship)
##   Registration cn.In Aboard Fatalities Ground
## 1                  1      2          1      0
## 2                         5          5      0
## 3                         1          1      0
## 4                        20         14      0
## 5                        30         30      0
## 6                        41         21      0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Summary
## 1 During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a height of approximately 75 feet, killing Lt. Thomas E. Selfridge who was a passenger. This was the first recorded airplane fatality in history.  One of two propellers separated in flight, tearing loose the wires bracing the rudder and causing the loss of control of the aircraft.  Orville Wright suffered broken ribs, pelvis and a leg.  Selfridge suffered a crushed skull and died a short time later.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                         First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight.
## 3                                                                                                                                                                                                                                                                                                                                                                                              The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed.
## 4                                                                                                                                                                                                                                                                                                        The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Island into the sea. The ship broke in two and the control car immediately sank drowning its occupants.
## 5                                                                                                                                                                                                                                                                                                                                                                                    Hydrogen gas which was being vented was sucked into the forward engine and ignited causing the airship to explode and burn at 3,000 ft..
## 6                                                                                                                                                                                                                                                                                                                                                                                                                           Crashed into trees while attempting to land after being shot down by British and French aircraft.
## Size of data
dim(crashes)
## [1] 5268   13
## R's classification of data
class(crashes)
## [1] "data.frame"
## R's classification of variables
str(crashes)
## 'data.frame':    5268 obs. of  13 variables:
##  $ Date        : Factor w/ 4753 levels "01/01/1966","01/01/1970",..: 3297 2372 2699 3184 3682 852 3099 2585 3387 3469 ...
##  $ Time        : Factor w/ 1006 levels "","00:00","00:01",..: 702 200 1 757 385 36 609 1 36 986 ...
##  $ Location    : Factor w/ 4304 levels "","1,200 miles off Dakar, AtlantiOcean",..: 848 123 4179 3388 2294 4059 3117 2288 283 3550 ...
##  $ Operator    : Factor w/ 2477 levels "","A B Aerotransport",..: 1567 1578 1825 1466 1466 1466 1466 1465 1466 1466 ...
##  $ Flight..    : Factor w/ 725 levels "","-","002","004",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Route       : Factor w/ 3245 levels ""," - Tegucigalpa - Toncontin",..: 831 2982 1 1 1 1 1 1 1 1 ...
##  $ Type        : Factor w/ 2447 levels "","AAC-1 Toucan",..: 2419 1145 1024 2433 2435 2446 2434 2148 2439 2438 ...
##  $ Registration: Factor w/ 4906 levels ""," / ","01 / 02 / 03",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ cn.In       : Factor w/ 3708 levels ""," / 185-5547",..: 171 1 1 1 1 1 1 1 1 1 ...
##  $ Aboard      : int  2 5 1 20 30 41 19 20 22 19 ...
##  $ Fatalities  : int  1 5 1 14 30 21 19 20 22 19 ...
##  $ Ground      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Summary     : Factor w/ 4674 levels "","31-7952246The pilot reported that he had a rough running engine and was making an emergency landing at Charlo A"| __truncated__,..: 1544 1676 3568 3213 1820 1189 1637 1203 2226 2266 ...
#sapply(mtcars, class)


################ Added by me:

# Deleting incomplete columns
crashes$Registration <- NULL
crashes$cn.In <- NULL 

## Fixing Date format so R can read it to only include 'years'
crashes$Date <- as.Date(crashes$Date, "%m/%d/%Y")
## Warning in strptime(x, format, tz = "GMT"): unknown timezone 'zone/tz/
## 2018e.1.0/zoneinfo/Australia/Sydney'
crashes$Date <- format(crashes$Date, '%Y')

## delete empty entries
exc.Numbers = (is.na(crashes$Fatalities) | is.na(crashes$Aboard))
crashes = crashes[!exc.Numbers, ]

## created new subset with data we'll use

crashes$survival = 1 - (crashes$Fatalities/crashes$Aboard)

# conditional statements taking only >18 passenger planes and eliminating total failure crashes
crashescleaned =  subset(crashes, crashes$Aboard > 18 & crashes$survival > 0)


2.2 How has the survival rate of passengers aboard commercial flight crashes changed over time?

The survival rate is interpreted as 1 - round(crashes$fatalities/crashes$aboard, 5) disregarding ground fatalities as they are collateral from the reasons for the crash itself which is the scope of this research. The data also disregards any plane with 18 or less members on board as that is below the threshold for commercial flights (18 is the max capacity of a private jet) which skews results.

The data also has entries with survival rates of 0 removed because completely fatal crashes skew any possible correlations or trends and are caused by factors beyond the scope of this question. However, survival rates of <0.01 (1 person surviving) have been kept and rounded down to 0 and are taken to represent the new benchmark for ‘complete failure.’ This is referred to as the ‘cleaned dataset’ in this project.

################ Added by me:

#Survival rate calc
survival = 1 - (crashescleaned$Fatalities/crashescleaned$Aboard)



hist(survival, col = "lightblue", xlab = "Survial Rate", main ="Histogram of Survival Rates For Sampled Data")


At first glimpse, the cleaned sample data for the whole crash ‘population’ reveals that there is a high spread in the data going through all the possible values from 0-1.Taking into account the IDA, the histogram suggests that there are more cases of complete survival than complete annihilation (0-1 surviors). The data also reveals more of a skew at the extremes (0 or 1) revealing that there is little middle ground for error in a crash and could suggest that proper prevention is far more beneficial than damage mitigation due to the smaller mode of 0.5 (50/50) survival rates.

However it is difficult to find any meaningful correlation to answer the question so we must look at individual companies and assess their crash survival rates individually over time to determine if there are any improvements.

crashesoperator = subset(crashescleaned, crashescleaned$Operator == "Aeroflot")
crashesoperator2 = subset(crashescleaned, crashescleaned$Operator == "Pan American World Airways")
crashesoperator3 = subset(crashescleaned, crashescleaned$Operator == "Military - U.S. Air Force")
crashesoperator4 = subset(crashescleaned, crashescleaned$Operator == "Philippine Air Lines")


xop = crashesoperator$Date
yop = crashesoperator$survival

xop2 = crashesoperator2$Date 
yop2 = crashesoperator2$survival

xop3 = crashesoperator3$Date 
yop3 = crashesoperator3$survival

xop4 = crashesoperator4$Date 
yop4 = crashesoperator4$survival

L = lm(yop ~ xop)


par(mfrow = c(1,2))

plot(xop, yop, pch = 16,col = "blue", xlab = "Date", ylab = "Survival Rate" , main = "Survival Rates Aeroflot")

plot(xop2, yop2, pch = 16,col = "red", xlab = "Date", ylab = "Survival Rate" , main = "Survival Rates Pan American Worldwide")

plot(xop3, yop3, pch = 16,col = "green", xlab = "Date", ylab = "Survival Rate" , main = "Military - U.S. Air Force")

plot(xop4, yop4, pch = 16,col = "turquoise", xlab = "Date", ylab = "Survival Rate" , main = "Philippine Air Lines")

The data reveals that Aeroflot’s crash rate is an anomaly in the total subset of crashes ordered by airline operators. Although these 4 data subsets vary by the number of entries made, the general trend is clear to see for 3 of the other operators to be trending upwards towards to complete or near complete survival rates over time. Aeroflot’s lack of success is also further supported by the considerably higher frequency in recorded entries, which assuming the original data set includes every recorded crash (every major crash), indicates that Aeroflot could be notoriously unreliable and potentially riskier to fly on than the other operators due to internal issues, such as maintenance or external confounding factors that need to be explored in further studies.

However, disregarding this outlier, the general trend shows an improvement in plane crash survival rates over time. This study also takes into account the frequency of occurrence of these crash records where a smaller mode indicates a successful airline with a survival rate of 1 consistently. The remaining 3 operators show declining crash occurrences (mode) over time, again indicating a consistent improvement in survival rates, both military and commercial.

2.3 How has the rate of crashes per year changed?

Using the cleaned dataset, we can view the frequency of the crashes as a function of time in years.

# Calculating crash frequency over time
fdate = as.data.frame(table(crashescleaned$Date))
plot(fdate, pch = 14, col = "green", xlab = "Date", ylab = "Frequency", main = "Frequency of Crashes by Year")         

We can observe that the number of crashes follows a ‘normal curve’ shape and declines as we approach present day. This reduction in the number of crashes is contrary to the exponential increase in airplane use that can be observed today. Comparing two such data sets could reveal that although there are more planes in the sky now than ever, the number of crashes has decreased, possibly due to safer internal policies or other factors.

Pairing this with the trends observed in the first segment, we can observe that flights crashes are becoming more infrequent and the survival rates are going up in the rare chance one does happen.

#Are we only allowed to use one data set?

2.4 What were the main factors involved in airplane crashes?

By examining the factors involved in these crashes we can provide preliminary leads for where further research should be taken out to improve the survival rates and decrease crash occurances overall. This can be acheived by graphing crash frequency over the 10 most common extracted keywords from the summary tab.

crashescleaned$Summary <- sapply(crashescleaned$Summary, as.character)

crashescleaned$power <- grepl(pattern = "power", crashescleaned$Summary, ignore.case = TRUE)

crashescleaned$struck <- grepl(pattern = "struck", crashescleaned$Summary, ignore.case = TRUE)

crashescleaned$wing <- grepl(pattern = "wing", crashescleaned$Summary, ignore.case = TRUE)

crashescleaned$fuel <- grepl(pattern = "fuel", crashescleaned$Summary, ignore.case = TRUE)

crashescleaned$trees <- grepl(pattern = "trees", crashescleaned$Summary, ignore.case = TRUE)

crashescleaned$failed <- grepl(pattern = "failed", crashescleaned$Summary, ignore.case = TRUE)

crashescleaned$shot_down <- grepl(pattern = "shot down", crashescleaned$Summary, ignore.case = TRUE)


crashescleaned$cargo <- grepl(pattern = "cargo", crashescleaned$Summary, ignore.case = TRUE)


crashescleaned$Ground <- grepl(pattern = "ground", crashescleaned$Summary, ignore.case = TRUE)

crashescleaned$Error <- grepl(pattern = "error", crashescleaned$Summary, ignore.case = TRUE)





Engine = sum(crashescleaned$power, na.rm=TRUE)
Struck = sum(crashescleaned$struck, na.rm=TRUE)
Wing = sum(crashescleaned$wing, na.rm=TRUE)
Fuel = sum(crashescleaned$fuel, na.rm=TRUE)
Trees = sum(crashescleaned$trees, na.rm=TRUE)
Failed = sum(crashescleaned$failed, na.rm=TRUE)
Shot_Down = sum(crashescleaned$shot_down, na.rm=TRUE)
Cargo = sum(crashescleaned$cargo, na.rm=TRUE)
Ground = sum(crashescleaned$ground, na.rm=TRUE)
Error = sum(crashescleaned$error, na.rm=TRUE)

causesy = c(Engine,Struck,Wing,Fuel, Trees, Failed, Shot_Down, Cargo, Ground, Error)


causesx = c('Engine', 'Struck','Wing','Fuel','Trees','Failed','Shot Down','Cargo','Ground','Error')

library(lattice)
dotplot(causesy~causesx, main = "Frequency of Keywords in Crash Summary", xlab = "Cause", ylab = "Frequency")

As can be seen from the chart, the majority of the crashes involved the ‘wing’ of the plane in some way or another. The wing comprises of more than 70% of an airplane’s functionality, thus this data provides preliminary insights into where subsequent research can be conducted to make improvements. In contrast, ground related accidents are few and far between over the dataset’s lifespan and can be correlated with further research with ‘ground related’ accident survival rates to see if fixing these issues would yield much of an improvement in the survival rate i.e. if the survival rate is already 1, then it would be a waste of time.


3 References

Style: APA

Aiden (2013, June 5). Airplain Crashes and Fatalities Since 1908. Retreved from https://opendata.socrata.com/Government/Airplane-Crashes-and-Fatalities-Since-1908/q2te-8cvq

Cai, E. (2015, February 3). How to get the frequency table of a categorical variable as a data frame in R [Web log post]. Retrieved from https://www.r-bloggers.com/how-to-get-the-frequency-table-of-a-categorical-variable-as-a-data-frame-in-r/

National Aeronautics and Space Administration (2015). Retrieved September 13, 2018, from https://www.grc.nasa.gov/www/k-12/airplane/airplane.html

User, S. (2018, April 3). An overview of keyword extraction techniques [Web log post]. Retrieved from https://www.r-bloggers.com/an-overview-of-keyword-extraction-techniques/


4 Personal reflection on group work

  • The way I contributed was

Writing the IDA

  • What I learnt about group work was …

The importance of forming a good plan of attack and the order of which each person needs to complete each task. Also that constant comunication is the only way for a group to function efficently.