Summary

Kaggle data on the history of airplane crashes reveals a decrease in the total number of crashes over the last 30 years. However, the percentage of passangers surviving a crash has remained stable since 1990, amounting to approximately 20%. Unfortunately, we are not more likley to survive a crash than we were 30 years ago.


#loading data
air_crashes=read.csv("input/Airplane_Crashes_and_Fatalities_Since_1908.csv")
str(air_crashes)
## 'data.frame':    5268 obs. of  13 variables:
##  $ Date        : Factor w/ 4753 levels "01/01/1966","01/01/1970",..: 3297 2372 2699 3184 3682 852 3099 2585 3387 3469 ...
##  $ Time        : Factor w/ 1006 levels "","00:00","00:01",..: 702 200 1 758 385 36 609 1 36 986 ...
##  $ Location    : Factor w/ 4304 levels "","1,200 miles off Dakar, AtlantiOcean",..: 848 123 4179 3388 2294 4059 3117 2288 284 3550 ...
##  $ Operator    : Factor w/ 2477 levels "","A B Aerotransport",..: 1567 1578 1825 1466 1466 1466 1466 1465 1466 1466 ...
##  $ Flight..    : Factor w/ 725 levels "","-","002","004",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Route       : Factor w/ 3245 levels ""," - Tegucigalpa - Toncontin",..: 831 2982 1 1 1 1 1 1 1 1 ...
##  $ Type        : Factor w/ 2447 levels "","Aérospatiale SE-210 Caravelle",..: 2419 1145 1032 2433 2435 2446 2434 2148 2439 2438 ...
##  $ Registration: Factor w/ 4906 levels ""," / ","01 / 02 / 03",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ cn.In       : Factor w/ 3708 levels ""," / 185-5547",..: 171 1 1 1 1 1 1 1 1 1 ...
##  $ Aboard      : int  2 5 1 20 30 41 19 20 22 19 ...
##  $ Fatalities  : int  1 5 1 14 30 21 19 20 22 19 ...
##  $ Ground      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Summary     : Factor w/ 4674 levels "","31-7952246The pilot reported that he had a rough running engine and was making an emergency landing at Charlo Airport. While ma"| __truncated__,..: 1544 1676 3568 3213 1820 1189 1637 1203 2226 2266 ...

For the goal of this analysis, the columns of interest are (for each crash): year, number of people aboard, number of fatalities.

library(dplyr)
library(anytime)

#graphical options
source("setPowerPointStyle.R")
setPowerPointStyle()

#get variables of interest
my_air_crashes=select(air_crashes,Date,Aboard,Fatalities)

#pull out year
my_air_crashes$Date=as.numeric(format(anytime(my_air_crashes$Date),"%Y"))

my_air_crashes=na.omit(my_air_crashes)

head(my_air_crashes)
##   Date Aboard Fatalities
## 1 1908      2          1
## 2 1912      5          5
## 3 1913      1          1
## 4 1913     20         14
## 5 1913     30         30
## 6 1915     41         21

The total number of accidents has decreased since 1990

source("setPowerPointStyle.R")
setPowerPointStyle()

my_air_crashes$survivors=100*(my_air_crashes$Aboard-my_air_crashes$Fatalities)/my_air_crashes$Aboard

#extracting data since 1990 (using the full data leads to the same conclusion...)

my_air_crashes_1990=filter(my_air_crashes,Date>=1990)

plot(names(table(my_air_crashes_1990$Date)),
     as.integer(table(my_air_crashes_1990$Date)),
     xlab='year',ylab='number of crashes',type='l')


The boxplot suggests that no substantial increase in the percentage of survivors over the last 30 years. Let’s have a closer look at the average percentages over the last 30 years.

The percentage of survivors has remained stable since 1990

source("setPowerPointStyle.R")
setPowerPointStyle()

survivors_by_year_low=tapply(my_air_crashes$survivors,my_air_crashes$Date,
                             function(x) mean(x)-sd(x)/sqrt(length(x)))

survivors_by_year=tapply(my_air_crashes$survivors,my_air_crashes$Date,mean)

survivors_by_year_high=tapply(my_air_crashes$survivors,my_air_crashes$Date,
                             function(x) mean(x)+sd(x)/sqrt(length(x)))


plot(as.numeric(names(survivors_by_year_high)),survivors_by_year_high,
     xlim=c(1990,2008),ylim=c(0,35),xlab='year',ylab='% of survivors',cex=2,type='l')

lines(as.numeric(names(survivors_by_year_low)),survivors_by_year_low,
      xlim=c(1990,2008),ylim=c(0,35),xlab='year',ylab='mean % of survivors',cex=2)

polygon(c(as.numeric(names(survivors_by_year_high)),
          rev(as.numeric(names(survivors_by_year_high)))), 
        c(survivors_by_year_high, rev(survivors_by_year_low)),
        col = "skyblue", border = NA)

lines(as.numeric(names(survivors_by_year)),survivors_by_year,type='l',
      xlim=c(1990,2008),ylim=c(0,35),xlab='year',ylab='% of survivors',cex=2,col='red')

The above figure shows the average percentage of survivors (red line), plus or minus the standard error of the mean (blue area). It appears that there was not substantial improvement.