Kaggle data on the history of airplane crashes reveals a decrease in the total number of crashes over the last 30 years. However, the percentage of passangers surviving a crash has remained stable since 1990, amounting to approximately 20%. Unfortunately, we are not more likley to survive a crash than we were 30 years ago.
#loading data
air_crashes=read.csv("input/Airplane_Crashes_and_Fatalities_Since_1908.csv")
str(air_crashes)
## 'data.frame': 5268 obs. of 13 variables:
## $ Date : Factor w/ 4753 levels "01/01/1966","01/01/1970",..: 3297 2372 2699 3184 3682 852 3099 2585 3387 3469 ...
## $ Time : Factor w/ 1006 levels "","00:00","00:01",..: 702 200 1 758 385 36 609 1 36 986 ...
## $ Location : Factor w/ 4304 levels "","1,200 miles off Dakar, AtlantiOcean",..: 848 123 4179 3388 2294 4059 3117 2288 284 3550 ...
## $ Operator : Factor w/ 2477 levels "","A B Aerotransport",..: 1567 1578 1825 1466 1466 1466 1466 1465 1466 1466 ...
## $ Flight.. : Factor w/ 725 levels "","-","002","004",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Route : Factor w/ 3245 levels ""," - Tegucigalpa - Toncontin",..: 831 2982 1 1 1 1 1 1 1 1 ...
## $ Type : Factor w/ 2447 levels "","Aérospatiale SE-210 Caravelle",..: 2419 1145 1032 2433 2435 2446 2434 2148 2439 2438 ...
## $ Registration: Factor w/ 4906 levels ""," / ","01 / 02 / 03",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ cn.In : Factor w/ 3708 levels ""," / 185-5547",..: 171 1 1 1 1 1 1 1 1 1 ...
## $ Aboard : int 2 5 1 20 30 41 19 20 22 19 ...
## $ Fatalities : int 1 5 1 14 30 21 19 20 22 19 ...
## $ Ground : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Summary : Factor w/ 4674 levels "","31-7952246The pilot reported that he had a rough running engine and was making an emergency landing at Charlo Airport. While ma"| __truncated__,..: 1544 1676 3568 3213 1820 1189 1637 1203 2226 2266 ...
For the goal of this analysis, the columns of interest are (for each crash): year, number of people aboard, number of fatalities.
library(dplyr)
library(anytime)
#graphical options
source("setPowerPointStyle.R")
setPowerPointStyle()
#get variables of interest
my_air_crashes=select(air_crashes,Date,Aboard,Fatalities)
#pull out year
my_air_crashes$Date=as.numeric(format(anytime(my_air_crashes$Date),"%Y"))
my_air_crashes=na.omit(my_air_crashes)
head(my_air_crashes)
## Date Aboard Fatalities
## 1 1908 2 1
## 2 1912 5 5
## 3 1913 1 1
## 4 1913 20 14
## 5 1913 30 30
## 6 1915 41 21
source("setPowerPointStyle.R")
setPowerPointStyle()
my_air_crashes$survivors=100*(my_air_crashes$Aboard-my_air_crashes$Fatalities)/my_air_crashes$Aboard
#extracting data since 1990 (using the full data leads to the same conclusion...)
my_air_crashes_1990=filter(my_air_crashes,Date>=1990)
plot(names(table(my_air_crashes_1990$Date)),
as.integer(table(my_air_crashes_1990$Date)),
xlab='year',ylab='number of crashes',type='l')
The boxplot suggests that no substantial increase in the percentage of survivors over the last 30 years. Let’s have a closer look at the average percentages over the last 30 years.
source("setPowerPointStyle.R")
setPowerPointStyle()
survivors_by_year_low=tapply(my_air_crashes$survivors,my_air_crashes$Date,
function(x) mean(x)-sd(x)/sqrt(length(x)))
survivors_by_year=tapply(my_air_crashes$survivors,my_air_crashes$Date,mean)
survivors_by_year_high=tapply(my_air_crashes$survivors,my_air_crashes$Date,
function(x) mean(x)+sd(x)/sqrt(length(x)))
plot(as.numeric(names(survivors_by_year_high)),survivors_by_year_high,
xlim=c(1990,2008),ylim=c(0,35),xlab='year',ylab='% of survivors',cex=2,type='l')
lines(as.numeric(names(survivors_by_year_low)),survivors_by_year_low,
xlim=c(1990,2008),ylim=c(0,35),xlab='year',ylab='mean % of survivors',cex=2)
polygon(c(as.numeric(names(survivors_by_year_high)),
rev(as.numeric(names(survivors_by_year_high)))),
c(survivors_by_year_high, rev(survivors_by_year_low)),
col = "skyblue", border = NA)
lines(as.numeric(names(survivors_by_year)),survivors_by_year,type='l',
xlim=c(1990,2008),ylim=c(0,35),xlab='year',ylab='% of survivors',cex=2,col='red')
The above figure shows the average percentage of survivors (red line), plus or minus the standard error of the mean (blue area). It appears that there was not substantial improvement.