Introduction

In this project we are going to work with a dataset about Historical Plane Crashes, which analyses aerial accidents occured since 1908 to 2018. In this dataset we can find the date of most accidents, its time, location, operator and type of aircraft, as well as the number of flight and registration, its route and the passengers abroad and the fatalities of each accident. Moreover, we can also find a brief summary of the cause of each accident.

To start with, our first objective will be to convert the data type of each variable, as this dataset does not interpret the passengers and fatalities as integers or any other string as what it ought to be.

After having done so, our aim is to analyse and be able to give an answer to some questions that we have asked ourselves for the project. This will be divided into three different parts, each of which tries to compute:

Part 1. Exploratory Data Analysis

Question 1. What is the number of crashes every month? What is the evolution of crashes throughout the years?

Question 2. Which are the operators of our dataset with more crashes?

Question 3. Can we find the number of fatalities of every operator?

Question 4. Analysis of the existing relation between the fatalities and number of passangers

Question 5. Can we compute the total number of fatalities that happened every month? What is its evolution throughout the years?

Part 2. Probability

Question 1. Can we compute a table where all the accidents, survivors and fatalities appear classified by month and year? This will enable us to work with almost any probability that we come across.

Question 2. Given that is December, what is the probablility of more than 50 fatalities in an accident?

Question 3. Given that the operator of the flight is “Aeroflot”, what is the probability of having used a specific type of aircraft? Can we compute the same for the “Military U.S. Air Force”?

Part 3. Random Variable. Simulations.

Question 1. Can we simulate the number of accidents of every month with a Poisson distribution? Can we determine the accuracy of this approximation?

Question 2. Can we simulate the number of accidents of every month in every year with this same distribution, and determine the accuracy of the approximation too?

Question 3. Can we simulate the total number of monthly fatalities? Can its accuracy be determined?

Bearing this questions in mind, we can now start our project. Due to the large tables that can appear in our project, we will only show the head of each of them, meaning its first rows.

We first load the libraries that we will need for the project

library(ggplot2)
library (tidyr)
library(dplyr)
library(lubridate)
library(leaflet)
library(kableExtra)

Preparation of our data

Read the information extracted from the internet

Df1=read.csv("data/planeCrash.csv",header = TRUE,encoding = "UTF8")

Transform the dates in diferent columns (easier to use):

Df2=data.frame(MDY=as.character(Df1$date))
Df3=separate(Df2, MDY, c("MD", "Y"), sep=",")
Df4=separate(Df3, MD, c("M", "D"), sep=" ")

Create a 2nd month column (with numbers insted of names) and create a order of the months names

nM1=Df4$M

month_order=c("January","February","March","April","May","June","July","August","September","October","November","December")

nM=NULL
for (i in 1:length(nM1))
  for (y in 1:length(month_order))
    if (nM1[i]==month_order[y])
      nM[i]=y

Transform into a number the fatalitiess and passangers

Df5=data.frame(passangers=Df1$aboard)
Df6=separate(Df5, passangers, c("passangers", "other"), sep=" Â")

Df7=data.frame(fatalities=Df1$fatalities)
Df8=separate(Df7, fatalities, c("fatalities", "other"), sep=" Â")

Create the data.frame that we will use in the rest of the project

PC=data.frame(year=as.numeric(Df4$Y), month=Df4$M, Nmonth=as.numeric(nM), day=as.numeric(Df4$D), passangers=as.numeric(Df6$passangers), fatalities=as.numeric(Df8$fatalities), operator=as.character(Df1$operator), aircraft=as.character(Df1$ac_type), description=Df1$summary) 

Part 1

Exercice 1.1: Table with all the crashes classified by month and year

E11=PC %>% group_by(year, Nmonth) %>% 
  summarise(accidents=n()) %>%
  spread(key=Nmonth, value=accidents)
tail(E11,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")
year 1 2 3 4 5 6 7 8 9 10 11 12
2009 4 7 4 6 3 5 4 5 3 3 7 1
2010 5 2 NA 4 5 3 4 10 4 5 4 2
2011 2 4 1 3 3 1 6 6 9 4 4 3
2012 2 NA 2 3 2 5 NA 2 2 2 3 6
2013 3 1 4 2 1 1 2 2 NA 5 9 3
2014 2 4 3 3 2 1 5 5 2 1 NA 5
2015 1 1 2 1 NA 2 2 2 2 4 3 2
2016 1 3 4 3 1 NA 2 2 NA 1 2 5
2017 1 1 2 NA 3 2 1 NA NA 1 1 2
2018 1 3 4 2 2 1 1 1 1 1 1 NA

In this table we can easily see all the accidents of our dataset classified by month and year of occurence. This table may be very useful so as to see, not only the tendency or evolution of fligts (which we will later compute), but also a very significative part of our dataset, with which we will work throughout our project. We must also note that NA stands for no accidents in that year and month.

Exercice 1.2: What is the number of crashes that happened in every month? What is the evolution of crashes/years?

E12M=PC %>% group_by(Nmonth) %>% summarise(accidents=n())
E12M %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive")) %>%
  column_spec(1,bold=T,border_right = T,background = "yellow")
Nmonth accidents
1 536
2 432
3 491
4 415
5 402
6 419
7 475
8 525
9 508
10 500
11 508
12 572
TotalAM=sum(E12M$accidents)
meanAM=mean(E12M$accidents)
devAM=sd(E12M$accidents)

ggplot(PC,aes(x=Nmonth))+geom_histogram(binwidth = 0.5)+geom_hline(yintercept=meanAM)+geom_hline(yintercept=meanAM+devAM, linetype="dashed")+geom_hline(yintercept=meanAM-devAM, linetype="dashed")

E12Y=PC %>% group_by(year) %>% summarise(accidents=n())
head(E12Y,10) %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")
year accidents
1908 1
1909 1
1912 1
1913 3
1915 2
1916 5
1917 7
1918 4
1919 8
1920 18
TotalAY=sum(E12Y$accidents)
meanAY=mean(E12Y$accidents)
devAY=sd(E12Y$accidents)


ggplot(PC,aes(x=year))+geom_bar()+geom_hline(yintercept=meanAY)+geom_hline(yintercept=meanAY+devAY, linetype="dashed")+geom_hline(yintercept=meanAY-devAY, linetype="dashed")

meanAM2=meanAM/111 #We have data from 111 years (1908-2018)
meanAY2=meanAY/12 

meanTotal=TotalAY/(111*12)

print(TotalAM) #The sum of nº of accidents of all the years = sum of all accidents of all months
## [1] 5783
print(meanTotal)
## [1] 4.341592
print(meanAM2)
## [1] 4.341592
print(meanAY2) #This mean is slightly bigger because when we did the mean with the accidents/year it don't count 3 years (1910, 1911 and 1914) where we don't have any accident
## [1] 4.462191

ACCIDENTS per MONTH

As a conclusion we may say that apparently most accidents occur during December and January. Maybe beacause the frequency of flights increases during these months.

Nevertheless, from August until the end of the year, the number of accidents reaches or even surpasses the 500, which is also noticeable. It could be understandable in August, as in summer most of the population tends to travel, and that could also be a reason for the increase in the frequency of flights.

All in all, in the light of these results we may say that winter is the most dangerous season of the year to travel, followed by August, in which the number of accidents is not low either. Having said so, spring would be the safest season of the year to travel.

ACCIDENTS per YEAR

In this other graphic we can confirm what we had previously expected.

The number of accidents from the fist years of the dataset is low. This may be due not only to the fact that so many years ago the frequency of flights was very low, but also because not all the accidents should have been noted in the record.

As well as that, these last years the number of accidents has decreased noticeably as an effect of the incorporation of advanced technology in flights and in the aerospatial sector in general.

Exercice 1.3: What is the number of crashes of every company? What are the 5 operators with more crashes?

E13=PC %>% group_by(operator) %>% summarise(accidents=n())%>%
  arrange(-accidents)

head(E13,10) %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")
operator accidents
Aeroflot 260
Military - U.S. Air Force 177
Air France 72
Deutsche Lufthansa 64
United Air Lines 44
China National Aviation Corporation 43
Military - U.S. Army Air Forces 43
Pan American World Airways 41
American Airlines 37
Military - Royal Air Force 36
ggplot(E13[c(1:5),],aes(x=reorder(operator, -accidents), y=accidents),)+geom_bar(stat="identity") 

ggplot(E13[c(1:50),],aes(x=reorder(operator, -accidents), y=accidents),)+geom_bar(stat="identity")

ggplot(E13 ,aes(x=reorder(operator, -accidents), y=accidents),)+geom_bar(stat="identity")

Aeroflot, the main national operator in Russia, stands for the company with a higher lever of accidents in the record, followed by the Military U.S Air Force. The 3rd operator is Air France, followed by Deutsche Lufthansa, and in the 5th position we find United Air Lines, a north-american operator.

The second graph is the top50 aerospacial companies with the highest accidents rate.

The third graph shows all the companies in the current database used.As we can see, the majority of these companies have 1 or 2 accidents, that can be true or it can be because we do not have enough information of these companies in the current database.

The first (top5) and the second (top50) graph are more useful to make an study, but the third one (all the companies) is not accurate enough.

Exercice 1.4: Nº of fatalities (and mean an deviance) of every operator

E14=PC %>% group_by(operator) %>% summarise(total_fatalities=sum(fatalities),mean_fatalities=mean(fatalities), deviance_fatalities=sd(fatalities)) %>% arrange(-1*total_fatalities)

head(E14,10) %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")
operator total_fatalities mean_fatalities deviance_fatalities
Aeroflot 9048 34.80000 34.306117
Military - U.S. Air Force 3718 21.00565 23.008027
American Airlines 1422 38.43243 64.417708
Pan American World Airways 1303 31.78049 49.358136
Military - U.S. Army Air Forces 1070 24.88372 9.781443
United Air Lines 1019 23.15909 23.634763
AVIANCA 941 39.20833 45.066115
Turkish Airlines (THY) 891 63.64286 90.620601
Indian Airlines 861 25.32353 30.985550
China Airlines (Taiwan) 847 60.50000 92.945600
ggplot(PC,aes(as.factor(operator), fatalities))+geom_boxplot()

These are the box plots of the fatalities of all operators. Due to the large number of operators in our dataset, the diferent boxplots of all the fatalities do not give us relevant information as we can see.

Exercice 1.5: Nº of fatalities (and mean an deviance) of every type of airplain

E15=PC %>% group_by(aircraft) %>% summarise(total_fatalities=sum(fatalities),mean_fatalities=mean(fatalities), deviance_fatalities=sd(fatalities)) %>% arrange(-1*total_fatalities)

head(E15,10) %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")
aircraft total_fatalities mean_fatalities deviance_fatalities
Douglas DC-3 4793 14.055718 9.668314
Douglas DC-6B 1054 37.642857 29.581293
Antonov AN-26 1042 28.944444 17.456795
Ilyushin IL-18B 1008 67.200000 29.854887
McDonnell Douglas DC-9-32 951 50.052632 37.171485
Douglas DC-4 937 22.853659 23.367243
de Havilland Canada DHC-6 Twin Otter 300 848 9.860465 6.832729
Yakovlev YAK-40 828 22.378378 17.007109
Tupolev TU-134A 808 47.529412 28.670799
McDonnell Douglas DC-10-10 804 134.000000 143.634258
ggplot(PC,aes(as.factor(aircraft), fatalities))+geom_boxplot()

In this exercise we have come to the same conclusion as in the one before. Due to the large number of different types of aircrafts in the dataset, the diferent boxplots of all the fatalities do not give us relevant information as we can see in the graph above.

Exercice 1.6: Relation between fatalities and nº of passangers (table and graphic representation):

E16T=PC %>% group_by(FxP=fatalities/passangers) %>% summarise(cases=n()) %>% arrange(-1*(FxP))

head(E16T,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")
FxP cases
1.0000000 3810
0.9939024 1
0.9935484 1
0.9934641 1
0.9925926 1
0.9923664 1
0.9911504 1
0.9909910 1
0.9903846 1
0.9902913 1
E16R=PC %>% group_by(FxP=round(fatalities/passangers, 2)) %>% summarise(cases=n()) %>% arrange(-1*(FxP))

head(E16R,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")
FxP cases
1.00 3810
0.99 16
0.98 21
0.97 24
0.96 33
0.95 32
0.94 35
0.93 31
0.92 22
0.91 25
ggplot(PC, aes(x=passangers, y=fatalities))+geom_point()+geom_abline(slope=1, linetype='dashed')

In this exercise we have studied the relation between the fatalities and the number of passangers in each of the accidents. Each of the points that we can see in this graph stands for an accident.

The diagonal line (x=y) represents the accidents where the number of fatalities has been the same as the number of passengers (which is the same to say that everyone in the flight has died). Obviously, all of the points are below this line.

Exercice 1.7: What is the number of fatalities that happened in every month? What is the evolution of fatalities/year?

E17M=PC %>% group_by(Nmonth) %>% summarise(fatalities=sum(fatalities, na.rm=TRUE))
E17M %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive")) %>%
  column_spec(1,bold=T,border_right = T,background = "red")
Nmonth fatalities
1 9146
2 8672
3 9560
4 7739
5 8052
6 8714
7 10650
8 10806
9 10899
10 8864
11 10702
12 11318
TotalFM=sum(E17M$fatalities)
meanFM=mean(E17M$fatalities)
devFM=sd(E17M$fatalities)

ggplot(E17M,aes(x=Nmonth, y=fatalities))+geom_bar(stat="identity")+geom_hline(yintercept=meanFM)+geom_hline(yintercept=meanFM+devFM, linetype="dashed")+geom_hline(yintercept=meanFM-devFM, linetype="dashed")

E17Y=PC %>% group_by(year) %>% summarise(fatalities=sum(fatalities, na.rm=TRUE))
head(E17Y,10) %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")
year fatalities
1908 1
1909 1
1912 5
1913 45
1915 40
1916 108
1917 138
1918 65
1919 20
1920 25
TotalFY=sum(E17Y$fatalities)
meanFY=mean(E17Y$fatalities)
devFY=sd(E17Y$fatalities)


ggplot(E17Y,aes(x=year, y=fatalities))+geom_bar(stat="identity")+geom_hline(yintercept=meanFY)+geom_hline(yintercept=meanFY+devFY, linetype="dashed")+geom_hline(yintercept=meanFY-devFY, linetype="dashed")

meanFM2=meanFM/111 #We have data from 111 years (1908-2018)
meanFY2=meanFY/12 

meanTotalF=TotalFY/(111*12)

print(TotalFM) #The sum of nº of accidents of all the years = sum of all accidents of all months
## [1] 115122
print(meanTotalF)
## [1] 86.42793
print(meanFM2)
## [1] 86.42793
print(meanFY2) #This mean is slightly bigger because when we did the mean with the accidents/year it don't count 3 years (1910, 1911 and 1914) where we don't have any accident
## [1] 88.8287

In this graphics we have studied the total number of fatalities per month.

MONTHLY

As it could be expected considering that December is the month of the year with a higher level of accidents, the month where there are more fatalities is also December. Surprisingly, from July to September the number of fatalities is very high too regarding the level of accidents in these years, which is not as high as in December. In addition, January - where we have seen that the number of accidents is high - we could think that the number of fatalities could be similar than December’s. Instead, the number of fatalities is significantly low, everything considered. Therefore, accidents occuring in the second half of the year appear to have more fatalities than those at the beginning of the year. As well as that, when January and December are concerned, although there are many accidents occuring by then, it seems safer to travel in January rather tham in December.

THROUGHOUT THE YEARS

In the second graphic we have come across a similar distribution to the one in the graphic containing the accidents per year that we have computed before. The reason may be the same, highlighting the possible missing information in the first years of the dataset, and confirming the positive effects of new technologies during the last years, where the fatalities decrease significantly.

Part 2

We will consider that in our exercise all the probabilities are conditional probabilities given that there has been an accident; P(accident)=1.

Exercice 2.1: All the accidents, fatalities and survivors by month and year

E21A=PC %>% group_by(year, Nmonth) %>% 
  summarise(Accidents=n()) %>%
  spread(key=Nmonth, value=Accidents)
tail(E21A,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")
year 1 2 3 4 5 6 7 8 9 10 11 12
2009 4 7 4 6 3 5 4 5 3 3 7 1
2010 5 2 NA 4 5 3 4 10 4 5 4 2
2011 2 4 1 3 3 1 6 6 9 4 4 3
2012 2 NA 2 3 2 5 NA 2 2 2 3 6
2013 3 1 4 2 1 1 2 2 NA 5 9 3
2014 2 4 3 3 2 1 5 5 2 1 NA 5
2015 1 1 2 1 NA 2 2 2 2 4 3 2
2016 1 3 4 3 1 NA 2 2 NA 1 2 5
2017 1 1 2 NA 3 2 1 NA NA 1 1 2
2018 1 3 4 2 2 1 1 1 1 1 1 NA
E21F=PC %>% group_by(year, Nmonth) %>% 
  summarise(Fatalities=sum(fatalities)) %>%
  spread(key=Nmonth, value=Fatalities)
tail(E21F,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")
year 1 2 3 4 5 6 7 8 9 10 11 12
2009 23 107 44 65 120 397 226 46 14 11 36 6
2010 104 18 NA 114 318 26 173 117 38 51 104 24
2011 80 35 9 57 56 47 191 75 146 39 17 12
2012 7 NA 14 173 60 187 NA 35 29 18 18 55
2013 30 5 28 7 4 20 13 6 NA 92 121 15
2014 4 109 246 18 22 49 484 55 14 2 NA 186
2015 37 40 160 7 NA 131 12 55 10 249 63 10
2016 2 26 94 30 66 NA 45 5 NA 5 75 171
2017 4 5 13 NA 6 125 16 NA NA 4 11 13
2018 12 140 100 258 121 10 1 20 1 189 1 NA
E21S=PC %>% group_by(year, Nmonth) %>% 
  summarise(Survivors=sum(passangers)-sum(fatalities)) %>%
  spread(key=Nmonth, value=Survivors)
tail(E21S,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "green")
year 1 2 3 4 5 6 7 8 9 10 11 12
2009 156 129 1 10 14 1 142 71 5 9 23 1
2010 8 0 NA 11 8 0 2 190 37 0 3 167
2011 149 6 0 6 0 5 68 7 15 17 3 0
2012 3 NA 0 10 6 4 NA 1 4 8 0 72
2013 0 40 3 108 0 0 304 2 NA 18 33 8
2014 5 4 0 0 1 0 11 11 5 2 NA 10
2015 0 18 0 0 NA 0 4 5 4 0 13 4
2016 0 80 1 0 0 NA 0 300 NA 0 6 1
2017 0 0 0 NA 1 0 0 NA NA 6 0 24
2018 0 4 21 148 1 0 18 0 46 0 127 NA

This table containing all the accidents, fatalities and survivors classified by month and year might be very useful in computing any probability. Marginal probabilities are quiclkly deduced ffrom it, and in the calculus of conditional probabilities it can be a useful tool.

Exercice 2.2: Given that it’s Decembre, probability of +50 fatalities

AccidentsDecember <- E12M$accidents[E12M$Nmonth==12]
totalAccidents <- TotalAM
pDecember <- AccidentsDecember/totalAccidents

filtrarJoinedProb <- PC %>% filter(fatalities>50,Nmonth==12)
countJoinedProb <- count(filtrarJoinedProb) #nombre d'accidents al desembre amb +50 fatalities

JoinedProb <- countJoinedProb/totalAccidents

conditionalProb <- JoinedProb/pDecember

print(as.numeric(conditionalProb))
## [1] 0.0979021
percentage=(as.numeric(conditionalProb))*100
cat("\nThis conditional probability equals the", percentage,"%")
## 
## This conditional probability equals the 9.79021 %

Exercice 2.3: Given that the operator is “Aeroflot” probab of having used a specific aircraft

#AEROFLOT
E231 <- PC%>%filter(operator=='Aeroflot')%>%group_by(aircraft)%>%summarise(count_Aeroflot=n())%>%arrange(-count_Aeroflot)
head(E231,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "orange")
aircraft count_Aeroflot
Yakovlev YAK-40 19
Antonov AN-24 13
Ilyushin IL-12 13
Ilyushin IL-14P 11
Tupolev TU-104B 10
Tupolev TU-134A 10
Ilyushin IL-18B 9
Li-2 8
Tupolev TU-124 8
Antonov An-24B 6
#MILITARY - U.S AIRFORCE
E232 <- PC%>%filter(operator=='Military - U.S. Air Force')%>%group_by(aircraft)%>%summarise(count_Military_US_AirForce=n())%>%arrange(-count_Military_US_AirForce)
head(E232,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "orange")
aircraft count_Military_US_AirForce
Boeing KC-135A 15
Lockheed C-130E Hercules 12
Lockheed C-130H Hercules 6
Lockheed C-130A Hercules 5
Douglas C-47D 4
Fairchild C-123K 4
Lockheed AC-130A Hercules 4
Lockheed C-130H 4
Boeing B-29 3
Douglas C-124A Globemaster 3
#AIRFRANCE
E233 <- PC%>%filter(operator=='Air France')%>%group_by(aircraft)%>%summarise(count_AirFrance=n())%>%arrange(-count_AirFrance)
head(E233,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "orange")
aircraft count_AirFrance
Dewoitine D-338 4
Douglas DC-3 4
Boeing B-707-328 2
Douglas DC-3D 2
Douglas DC-4 2
Douglas DC-4-1009 2
Junkers JU-52/3m 2
Potez 621 2
Aerospatiale BAe Concorde 101 1
Airbus A-340 1
#DEUTSCHE LUFTHANSA
E234 <- PC%>%filter(operator=='Deutsche Lufthansa')%>%group_by(aircraft)%>%summarise(count_DeutscheLufthansa=n())%>%arrange(-count_DeutscheLufthansa)
head(E234,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "orange")
aircraft count_DeutscheLufthansa
Junkers JU-52/3m 16
Junkers F-13 7
Dornier Merkur 3
Focke-Wulf FW 200 3
Fokker FG III 3
Junkers JU-52 3
Douglas DC-3 2
Heinkel He-70 2
AEGK 1
Arado V1 1
#UNITED AIR LINES
E235 <- PC%>%filter(operator=='United Air Lines')%>%group_by(aircraft)%>%summarise(count_UnitedAirLines=n())%>%arrange(-count_UnitedAirLines)
head(E235,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "orange")
aircraft count_UnitedAirLines
Douglas DC-3 3
Douglas DC-3A 3
Douglas DC-6 3
Douglas DC-6B 3
Boeing 247 2
Boeing B-727-22 2
Boeing B-747-122 2
Douglas DC-4 2
Douglas DST-A-207A 2
Vickers Viscount 745D 2

Each of these tables show the five top operators (the five with more crashes), with the different aircrafts that they used and their frequency of use.

Exercise 2.3.1. Given that we have had an accident with Aeroflot, what is the probability that the aircraft was a Yakovlev

total_Aeroflot <- sum(E231$count_Aeroflot)
Conditional_Yakovlev<-  (E231%>%filter(aircraft=="Yakovlev YAK-40")%>%select(count_Aeroflot))/total_Aeroflot
Conditional_Yakovlev <- as.numeric(Conditional_Yakovlev)
cat("Probability that in these conditions the aircraft is a Yakovlev:",Conditional_Yakovlev,",which is a",round((Conditional_Yakovlev*100),2),"%")
## Probability that in these conditions the aircraft is a Yakovlev: 0.07307692 ,which is a 7.31 %
Conditional_Douglas<-  (E231%>%filter(aircraft=="Douglas C-47")%>%select(count_Aeroflot))/total_Aeroflot 
Conditional_Douglas <- as.numeric(Conditional_Douglas)
cat("\nProbability that in these conditions the aircraft is a Douglas C-47:",Conditional_Douglas,",which is a",round((Conditional_Douglas*100),2),"%")
## 
## Probability that in these conditions the aircraft is a Douglas C-47: 0.007692308 ,which is a 0.77 %

Exercise 2.3.2. Given that we have had an accident with Military U.S. Air Force, what is the probability that the aircraft was a Boeing KC-135Av and a Fairchild C-119C

total_Military <- sum(E232$count_Military_US_AirForce)

Conditional_Military<-  (E232%>%filter(aircraft=="Boeing KC-135A")%>%select(count_Military_US_AirForce))/total_Military
Conditional_Military <- as.numeric(Conditional_Military)
cat("Probability that in these conditions the aircraft is a Boeing KC-135A:",Conditional_Military,",which is a",round((Conditional_Military*100),2),"%")
## Probability that in these conditions the aircraft is a Boeing KC-135A: 0.08474576 ,which is a 8.47 %
Conditional_Fairchild <-  (E232%>%filter(aircraft=="Fairchild C-119C")%>%select(count_Military_US_AirForce))/total_Military 
Conditional_Fairchild <- as.numeric(Conditional_Fairchild)
cat("\nProbability that in these conditions the aircraft is a Fairchild C-119C:",Conditional_Fairchild,",which is a",round((Conditional_Fairchild*100),2),"%")
## 
## Probability that in these conditions the aircraft is a Fairchild C-119C: 0.01694915 ,which is a 1.69 %

Part 3

Exercice 3.1: Simulate (with a poissone distribution) number of accidents of every month (sum of 111 years).How acurate is it?

SimAccidents1=rpois(12, meanAM)
E31=data.frame(Month=1:12,SimAccidents1,RealAccidents=E12M$accidents, Error=abs(SimAccidents1-E12M$accidents), RelError=abs(SimAccidents1-E12M$accidents)/E12M$accidents) 

E31 %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")
Month SimAccidents1 RealAccidents Error RelError
1 502 536 34 0.0634328
2 460 432 28 0.0648148
3 501 491 10 0.0203666
4 446 415 31 0.0746988
5 485 402 83 0.2064677
6 451 419 32 0.0763723
7 523 475 48 0.1010526
8 457 525 68 0.1295238
9 437 508 71 0.1397638
10 486 500 14 0.0280000
11 478 508 30 0.0590551
12 446 572 126 0.2202797
RelErrorSA=mean(E31$RelError)
print(RelErrorSA)
## [1] 0.09865234

It must be taken into account that this is a simulation, so the average of the error may vary every time that we simulate it.

Exercice 3.2: Simulate (with a poissone distribution) the number of accidents of every month in every year

E32=data.frame(SimYear=c(1:111))
for (i in 1:12){
  E32[i+1]=rpois(111, meanAM2) 
}
head(E32,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")
SimYear V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
1 2 6 4 6 5 2 7 2 4 2 7 4
2 11 4 2 7 2 3 5 2 7 10 8 7
3 6 4 5 4 3 8 4 4 1 1 5 6
4 3 5 2 7 5 5 7 10 9 2 1 9
5 1 6 2 4 0 8 8 7 0 4 6 5
6 1 9 6 5 0 6 3 6 3 4 4 6
7 6 6 5 4 6 6 1 4 9 4 6 6
8 3 4 3 5 3 5 1 1 9 3 4 7
9 6 5 3 3 6 2 6 4 1 8 5 7
10 2 7 8 7 4 1 2 6 4 5 2 6

Exercice 3.3: Simulate (with a poissone distribution) number of fatalities of every month (sum of 111 years).How acurate is it (error between simulation and reality)?

SimFatalities1=rpois(12, meanFM)
E33=data.frame(Month=1:12,SimFatalities1,RealFatalities=E17M$fatalities, Error=abs(SimFatalities1-E17M$fatalities), RelError=abs(SimFatalities1-E17M$fatalities)/E17M$fatalities) 

E33 %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")
Month SimFatalities1 RealFatalities Error RelError
1 9553 9146 407 0.0445003
2 9673 8672 1001 0.1154290
3 9662 9560 102 0.0106695
4 9524 7739 1785 0.2306500
5 9753 8052 1701 0.2112519
6 9686 8714 972 0.1115446
7 9577 10650 1073 0.1007512
8 9584 10806 1222 0.1130853
9 9585 10899 1314 0.1205615
10 9552 8864 688 0.0776173
11 9453 10702 1249 0.1167072
12 9688 11318 1630 0.1440184
RelErrorSF=mean(E33$RelError)
print(RelErrorSF)
## [1] 0.1163988

In this exercise, we must not forget either that this is a simulation, so the average of the error may vary every time that we simulate it.

Conclusions

The results that we have observed with our analysis

After having studied several graphics and tables with this dataset about Historical Plane Crashes, these have lead us to some conclusions:

PART 1

  1. To start with, we have observed that the peak of accidents, as well as the peak of fatalities, are both in December. Furthermore, the second half of the year is generally more conflitctive due to the fact that there are more fatalities in the accidents occured.

  2. When analysing the accidents per operator, as well as the accidents per type of aircraft, specially when obtaining graphics, we have seen that due to the large number of operators and aircrafts, this graphics turn out to be inefficient, because we cannot analyse much about them if we don’t limit to the ones with a higher frequency (5 top operators, as can be seen in exercise 1.3).

  3. In exercise 1.6 we have analysed the existing relation between the number of passengers in the flights and the fatalities after the accident. In the table we can see how many flights have had the same percentage of fatalities taking the number of passengers into consideration. In its graphic, the diagonal line (x=y) represents when fatalities = passengers (everyone dies). Obviously, all of the points, representing the accidents, are below this line.

  4. In the last part of the first chapter of our project, as well as analysing the aforementioned tendency of accidents depending on the month, we have studied the evolution of the number of accidents throughout the 111 years of our dataset.

4.1. On the one hand, With the graphic that we have obtained, we have reached the conclusion that on the first years of our dataset the ratio of accidents is low because of two reasons. The firts one refers to the number of flights in those years, that was undoubtedly lower than these decades’. There were not as much aeroplanes and, as well as that, flying abroad wasn’t as frequent as it is today. Moreover, the second reason is the fact that so many years ago some accidents may not be detected or put in the record, so the number of accidents those years that we can see in our dataset might not be 100% true or reliable.

4.2. On the other hand, the incorporation of new technologies in the aerospatial sector and in daily life in general, ought to be the cause of the decrease in the number of accidents in the last years of our dataset, as we can see in the graphic.

PART 2

  1. After having computed all the accidents, fatalities and survivors by month and year, we have analysed several conditioned probabilities. We have seen that the probabiliy of +50 fatalities in an accident, given that it has occured in December is close to the 10%. Although December is the month with more accidents and, linked to that, more fatalities (talking about the total number of them throughout the month), we can see if we have a look at the filtered table, that most accidents happening in December haven’t got a very high number of fatalities considering the capacity of the aircraft involved, so a 10% makes sense as a result to the question proposed.

  2. We have also analysed the probability of having used a specific aircraft given that the operator is “Aeroflot” or “Military U.S., regarding a table containing all the accidents of each operator classified into the type of aircraft used.

PART 3

  1. With a Poisson distribution we have simulated the number of accidents of every month, taking into consideration the 111 years of our dataset. In order to analyse its accuracy, we have calculated the error of every mesure, as it can be seen on the table computed, that has an average which is printed on the screen every time hat we simulate the program. We must factor in that due to the fact that it is a simulation, the error varies every time that we run it.Running the program several times we have obtained errors that vary from 0.087 to 0.13, which is not bad.

  2. Using a Possion distribution too, we have simulated the number of accidents of every month in every year, which can be seen in the table configured with this purpose in the exercise 3.2.

  3. At last, we have simulated (with a Poissone distribution again) the number of fatalities of every month (sum of 111 years).When checking its accuracy, we have computed the error between this simulation and reality, which is algo printed on the screen every time this exercise is simulated. As it can be seen, this last simulation is not inaccurate either.