Overview

The aviation is a complicated subject, and it has changed tremendously through the past decades. The avionics have been improved, planes became more sophisticated, pilots get less workload due to the improved automation. But did the accident and fatality rates improve? It is important to look at the statistics to answer that question. With each accident the aviation industry makes more improvement in the planes, in the law, in training, etc. This project is aimed to find out if the aviation accident and fatality rates have improved from 1975 to 2014.

Introduction

The data aviation_accidents used for this project was obtained through the NTSB.gov website. The data provides informarion about total number of accidents, total number of accidents with fatalities, total number of fatalities (including onboard and on the ground), total number of fatalities aboard, total of flight hours acquired among all of the pilots involved in the accidents, ratios of total accidents per 100000 flight hours, and ratios of total fatalities per 100000 flight hours. The purpose of this project is to find out, if the fatality rate and accident rate have improved from 1975 to 2014. The data has missing values for the Flight Hours and the ratios per 100000 flight hours for the year of 2011. This project is a general overview of the accident and fatality rates, it does not provide any data or information on the causes of the accidents and fatalities and does not include military aviation accidents and fatalities. For any further information refer to the footnotes.

Exploring the Data

#Load data "Aviation Accidents" and store it into environment.
aviation_accidents <- read.csv("table10_2014.csv")

#Review dataset "Aviation Accidents"
str(aviation_accidents)
## 'data.frame':    41 obs. of  8 variables:
##  $ Year                                  : int  1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 ...
##  $ All.Accidents                         : Factor w/ 38 levels "1,221","1,224",..: 35 36 37 38 34 33 32 31 30 29 ...
##  $ Fatal.Accidents                       : int  633 658 661 719 631 618 654 591 555 545 ...
##  $ Total.Fatalities                      : Factor w/ 39 levels "1,042","1,068",..: 7 4 8 10 5 6 9 3 2 1 ...
##  $ Fatalities.Aboard                     : Factor w/ 38 levels "1,021","1,061",..: 6 4 8 9 4 5 7 3 2 1 ...
##  $ Flight.Hours                          : Factor w/ 40 levels "","18,103,000",..: 31 35 36 37 40 38 39 34 30 32 ...
##  $ All.accidents.per.100000.flight.hours : num  13.87 13.17 12.91 12.08 9.88 ...
##  $ All.fatalities.per.100000.flight.hours: num  2.19 2.16 2.09 2.06 1.63 1.69 1.78 1.96 1.92 1.84 ...

The dataset contains 41 observations of the aviation Accidents from 1975 through 2014. There are 7 variables (the 1st column is not a variable, but the list of observations - years): All Accidents, Fatal Accidents, Total Fatalities, Fatalities Aboard, Flight Hours (the sum of all hours flown by the pilots involved in the accidents), All Accidents per 100000 Flight Hours (ratio), All fatalities per 100000 Flight Hours (ratio). Flight Hours, All Accidents per 100000 Flight Hours, and All fatalities per 100000 Flight Hours data is missing for the year of 2011 (see footnotes for more information).

Cleaning and Transforming Data

After reviewing dataset “Aviation Accidents” using the summary() function, it was noticed that All.accidents, Total.fatalities, Fatalities.aboard, and Flight.hours variables show as factor variables. We have to transform those factor variables into numeric variables (this section is listed before the “Exploring the Data (continued)” to make sure that the summary() function shows correct values).

#Transform factor variable All.accidents into a numeric variable.
aviation_accidents$All.Accidents <- as.character(aviation_accidents$All.Accidents)

aviation_accidents$All.Accidents <- as.numeric(gsub(",", "", aviation_accidents$All.Accidents))

#Transform factor variable Total.fatalities into a numeric variable.
aviation_accidents$Total.Fatalities <- as.numeric(aviation_accidents$Total.Fatalities)

##Transform factor variable Fatalities.aboard into a numeric variable.
aviation_accidents$Fatalities.Aboard <- as.character(aviation_accidents$Fatalities.Aboard)

aviation_accidents$Fatalities.Aboard <- as.numeric(gsub(",", "", aviation_accidents$Fatalities.Aboard))

##Transform factor variable Flight.hours into a numeric variable.
aviation_accidents$Flight.Hours <- as.character(aviation_accidents$Flight.Hours)

aviation_accidents$Flight.Hours <- as.numeric(gsub(",", "", aviation_accidents$Flight.Hours))

Exploring the Data (continued).

# Summary of data
summary(aviation_accidents)
##       Year      All.Accidents  Fatal.Accidents Total.Fatalities
##  Min.   :1975   Min.   :1221   Min.   :222.0   Min.   : 1.00   
##  1st Qu.:1985   1st Qu.:1654   1st Qu.:314.0   1st Qu.:11.00   
##  Median :1995   Median :2021   Median :401.0   Median :19.00   
##  Mean   :1995   Mean   :2294   Mean   :420.7   Mean   :19.76   
##  3rd Qu.:2005   3rd Qu.:2739   3rd Qu.:498.0   3rd Qu.:29.00   
##  Max.   :2014   Max.   :4216   Max.   :719.0   Max.   :39.00   
##                                                                
##  Fatalities.Aboard  Flight.Hours     
##  Min.   : 386.0    Min.   :18103000  
##  1st Qu.: 558.0    1st Qu.:23656250  
##  Median : 723.0    Median :25794500  
##  Mean   : 764.8    Mean   :26605850  
##  3rd Qu.: 945.0    3rd Qu.:28704500  
##  Max.   :1398.0    Max.   :38641000  
##                    NA's   :1         
##  All.accidents.per.100000.flight.hours
##  Min.   : 6.260                       
##  1st Qu.: 6.848                       
##  Median : 7.880                       
##  Mean   : 8.467                       
##  3rd Qu.: 9.540                       
##  Max.   :13.870                       
##  NA's   :1                            
##  All.fatalities.per.100000.flight.hours
##  Min.   :1.120                         
##  1st Qu.:1.300                         
##  Median :1.535                         
##  Mean   :1.556                         
##  3rd Qu.:1.750                         
##  Max.   :2.190                         
##  NA's   :1

The ratios of all accidents per 100000 flight hours range from 6.26 to 13.87, and the ratios of all fatalities per 100000 flight hours range from 1.12 to 2.19.

To see the trend in the accidents, fatalities, and fatalities per 100000 flight hours from year 1975 to 2014, we will look at the scatterplots.

#Create a scatterplot and review relationship between the years and accident rate per 100000 flight hours.
plot(aviation_accidents$Year, aviation_accidents$All.accidents.per.100000.flight.hours, type = "o", main = "Accident Ratio per 100000 Flight Hours", xlab = "YEAR", ylab = "ACCIDENTS (PER 100000 FH)")

Looking at the scatterplot, we can tell that overall the ratio between accidents and 100000 flight hours has decreased over the years. From about 1998 through 2014 the accident ratio wasn’t changing much. The gap in the data is due to the missing values for the year of 2011.

#Create a scatterplot and review relationship between the years and fatality rate per 100000 flight hours.
plot(aviation_accidents$Year, aviation_accidents$All.fatalities.per.100000.flight.hours, type = "o", main = "Fatality Ratio per 100000 Flight Hours", xlab = "YEAR", ylab = "FATALITIES (PER 100000 FH)")

In the fatality rate scatterplot we notice similar trend as we have seen in the accident rate scatterplot. Since the ratio range is smaller in the fatalities per 100000 flight hours (from 1.12 to 2.19, compared to 6.26 to 13.87 in the accident per 100000 flight hours ratios), we can see some noticeable drop in the fatality rate before 1980 and increase in the fatality ratio between 1991 and 1995. After 1998 the fatality rate dropped low and stayed stable. The gap in the data is due to the missing values for the year of 2011.

Analysis

Inference for Two Independent Proportions (Accident Analysis).

Since our primary goal is to find out whether the aviation accident and fatality rates have improved in the past decades, we will use inference for Two Indepenndent Proportions to do that.

We will try to find out first, if the accident rate per total flight hours has improved since 1975. The question is whether the accident rate has improved from 1975 to 2014.

\(H_0: p_1 = p_2\)

\(H_A: p_1 > p_2\)

Our Null hypothesis is that the rate of accidents per total flight hours was the same from 1975 through 2014. The alternative hypothesis is that the rate of the accidents per total flight hours has gotten better since 1975 through 2014. We will compare our p-value to the significance level of 0.005. (p1 is the ratio in 1975, p2 is the ratio in 2014). The hypothesis is one-tailed.

#Let's store proportion p1 of 1975 and p2 of 2014, x1 and x2, and n1 and n2.

n1 <- 28799000
n2 <- 18103000
x1 <- 3995
x2 <- 1221
p_pool <- (x1+x2)/(n1+n2)

#Find and store the Standard Error.
SE <- sqrt(p_pool*(1 - p_pool)/n1 + p_pool*(1 - p_pool)/n2) 

#Find and store point_estimate and Test statistic.
point_estimate <- x1/n1 - x2/n2
ts <- point_estimate/SE

#Find the p-value.
pnorm(ts, lower.tail = FALSE)
## [1] 9.588377e-113

Since the p-value is smaller than significance level of 0.005, we reject the Null hypothesis.

Conclusion: the evidence shows that the proportion of accidents to flight hours has decreased from 1975 to 2014.

Inference for Two Independent Proportions (Fatality Analysis).

Our second goal is to find out if the fatality rate has improved since 1975. Why do we need to do that, if we already proved that the accident rate has improved? Because even though there might be less accidents per 100000 flight hours, the fatality rate could still be larger, if, for example, more people on board of each aircraft started dying after 1975 (note: with the time economy changes, more or less people are able to fly, so the load of souls per aircraft might significantly change as well). Question: has the fatality rate improved from 1975 to 2014?

\(H_0: p_1 = p_2\)

\(H_A: p_1 > p_2\)

Our Null hypothesis is that the rate of fatalities per total flight hours stayed the same from 1975 through 2014.. The alternative hypothesis is that the rate of the fatalities per total flight hours has gotten better since 1975 through 2014. We will compare our p-value to the significance level of 0.005. (p1 is the ratio in 1975, p2 is the ratio in 2014). The hypothesis is one-tailed.

#Let's store proportion p1 of 1975 and p2 of 2014, x1 and x2, and n1 and n2.

n1 <- 28799000
n2 <- 18103000
x1 <- 1252
x2 <- 419
p_pool <- (x1+x2)/(n1+n2)

#Find and store the Standard Error.
SE <- sqrt(p_pool*(1 - p_pool)/n1 + p_pool*(1 - p_pool)/n2) 

#Find and store point_estimate and Test statistic.
point_estimate <- x1/n1 - x2/n2
ts <- point_estimate/SE

#Find the p-value.
pnorm(ts, lower.tail = FALSE)
## [1] 3.50254e-30

Since the p-value is smaller than the significance level of 0.005, we reject the Null hypothesis.

Conclusion: the evidence shows that the rate of fatalities per flight hours has improved from 1975 through 2014.

Conclusions

Based on the data used in this analysis, it appears that the total accident and total fatality rates in the aviation have improved since 1975 through 2014.

Limitations

This data has a few limitations. The most important limitation is that it is unknown what were the causes of all the accidents, that’s why it is unclear if the accident and fatality rates have improved due to automation, better crew resource management, or because of some other reasons. To understand what the primary causes of improvements were, more data should be collected and a separate research and analysis should be conducted to find out why the accident and fatality rates have improved in the civil aviation in the past decades. It should also be clear that by improving automation, pilots’ complacency also rises, which creates a new cause in the aviation accident analysis; therefore, while certain areas in the aviation improve, some areas still need improved quality and attention.

There is also data missing for both values for the year of 2011, which could have impacted the results, but most likely it would have not changed the end results of rejecting or accepting the Null hypotheses.


NOTES:

2014 data are preliminary.

Flight hours are estimated by the Federal Aviation Administration. Miles flown and departure information for general aviation operations is not available. Also, note that the 2011 estimates are not currently available. The FAA is engaged in re-calibration efforts.

Suicide, sabotage and stolen/unauthorized aircraft cases, included in “Accidents” and “Fatalities”, but excluded from accident rates in this table are: 1995 (10 acc., 6 fatal acc.); 1996 (4, 0); 1997 (5, 2); 1998 (6, 4); 1999 (3, 1); 2000 (7, 7); 2001 (3, 1); 2002 (7, 6); 2003 (4, 3); 2004 (3, 0); 2005 (2, 1); 2006 (2, 1); 2007 (2, 2); 2008(2, 0); 2009 (3, 0); 2010 (3, 2); 2011 (1, 0); 2012 (1, 1); 2013 (3, 3); 2014 (0, 0)

The 706 total fatalities in 2006 includes the 154 persons killed aboard a foreign registered Boeing 737 aircraft operated by Gol Airlines when it collided with an Embraer Legacy 600 business jet over the Brazilian Amazon jungle.

49 CFR Part 830.1 pertains to accidents that involve civil aircraft and certain public aircraft of the United States <93>wherever they occur.<94> For the year 2014, the total number of accidents includes 23 U.S. registered (N-numbered) aircraft accidents that occurred outside the United States, its territories, or its possessions.

This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: Anna Rodrigues
Semester: Fall 2018