Historical Plane Crashes

Introduction

In this project we are going to work with a dataset about Historical Plane Crashes, which analyses aerial accidents occured since 1908 to 2018. In this dataset we can find the date of most accidents, its time, location, operator and type of aircraft, as well as the number of flight and registration, its route and the passengers abroad and the fatalities of each accident. Moreover, we can also find a brief summary of the cause of each accident.

To start with, our first objective will be to convert the data type of each variable, as this dataset does not interpret the passengers and fatalities as integers or any other string as what it ought to be.

After having done so, our aim is to analyse and be able to give an answer to some questions that we have asked ourselves for the project. This will be divided into three different parts, each of which tries to compute:

Part 1. Exploratory Data Analysis

Question 1. What is the number of crashes every month? What is the evolution of crashes throughout the years?

Question 2. Which are the operators of our dataset with more crashes?

Question 3. Can we find the number of fatalities of every operator?

Question 4. Analysis of the existing relation between the fatalities and number of passangers

Question 5. Can we compute the total number of fatalities that happened every month? What is its evolution throughout the years?

Part 2. Probability

Question 1. Can we compute a table where all the accidents, survivors and fatalities appear classified by month and year? This will enable us to work with almost any probability that we come across.

Question 2. Given that is December, what is the probablility of more than 50 fatalities in an accident?

Question 3. Given that the operator of the flight is “Aeroflot”, what is the probability of having used a specific type of aircraft? Can we compute the same for the “Military U.S. Air Force”?

Part 3. Random Variable. Simulations.

Question 1. Can we simulate the number of accidents of every month with a Poisson distribution? Can we determine the accuracy of this approximation?

Question 2. Can we simulate the number of accidents of every month in every year with this same distribution, and determine the accuracy of the approximation too?

Question 3. Can we simulate the total number of monthly fatalities? Can its accuracy be determined?

Bearing this questions in mind, we can now start our project. Due to the large tables that can appear in our project, we will only show the head of each of them, meaning its first rows.

We first load the libraries that we will need for the project

library(ggplot2)
library (tidyr)
library(dplyr)
library(lubridate)
library(leaflet)
library(kableExtra)

Preparation of our data

Read the information extracted from the internet

Df1=read.csv("data/planeCrash.csv",header = TRUE,encoding = "UTF8")

Transform the dates in diferent columns (easier to use):

Df2=data.frame(MDY=as.character(Df1$date))
Df3=separate(Df2, MDY, c("MD", "Y"), sep=",")
Df4=separate(Df3, MD, c("M", "D"), sep=" ")

Create a 2nd month column (with numbers insted of names) and create a order of the months names

nM1=Df4$M

month_order=c("January","February","March","April","May","June","July","August","September","October","November","December")

nM=NULL
for (i in 1:length(nM1))
  for (y in 1:length(month_order))
    if (nM1[i]==month_order[y])
      nM[i]=y

Transform into a number the fatalitiess and passangers

Df5=data.frame(passangers=Df1$aboard)
Df6=separate(Df5, passangers, c("passangers", "other"), sep=" Â")

Df7=data.frame(fatalities=Df1$fatalities)
Df8=separate(Df7, fatalities, c("fatalities", "other"), sep=" Â")

Create the data.frame that we will use in the rest of the project

PC=data.frame(year=as.numeric(Df4$Y), month=Df4$M, Nmonth=as.numeric(nM), day=as.numeric(Df4$D), passangers=as.numeric(Df6$passangers), fatalities=as.numeric(Df8$fatalities), operator=as.character(Df1$operator), aircraft=as.character(Df1$ac_type), description=Df1$summary)

Part 1

Exercice 1.1: Table with all the crashes classified by month and year

E11=PC %>% group_by(year, Nmonth) %>% 
  summarise(accidents=n()) %>%
  spread(key=Nmonth, value=accidents)
tail(E11,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")

year	1	2	3	4	5	6	7	8	9	10	11	12
2009	4	7	4	6	3	5	4	5	3	3	7	1
2010	5	2	NA	4	5	3	4	10	4	5	4	2
2011	2	4	1	3	3	1	6	6	9	4	4	3
2012	2	NA	2	3	2	5	NA	2	2	2	3	6
2013	3	1	4	2	1	1	2	2	NA	5	9	3
2014	2	4	3	3	2	1	5	5	2	1	NA	5
2015	1	1	2	1	NA	2	2	2	2	4	3	2
2016	1	3	4	3	1	NA	2	2	NA	1	2	5
2017	1	1	2	NA	3	2	1	NA	NA	1	1	2
2018	1	3	4	2	2	1	1	1	1	1	1	NA

In this table we can easily see all the accidents of our dataset classified by month and year of occurence. This table may be very useful so as to see, not only the tendency or evolution of fligts (which we will later compute), but also a very significative part of our dataset, with which we will work throughout our project. We must also note that NA stands for no accidents in that year and month.

Exercice 1.2: What is the number of crashes that happened in every month? What is the evolution of crashes/years?

E12M=PC %>% group_by(Nmonth) %>% summarise(accidents=n())
E12M %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive")) %>%
  column_spec(1,bold=T,border_right = T,background = "yellow")

Nmonth	accidents
1	536
2	432
3	491
4	415
5	402
6	419
7	475
8	525
9	508
10	500
11	508
12	572

TotalAM=sum(E12M$accidents)
meanAM=mean(E12M$accidents)
devAM=sd(E12M$accidents)

ggplot(PC,aes(x=Nmonth))+geom_histogram(binwidth = 0.5)+geom_hline(yintercept=meanAM)+geom_hline(yintercept=meanAM+devAM, linetype="dashed")+geom_hline(yintercept=meanAM-devAM, linetype="dashed")

E12Y=PC %>% group_by(year) %>% summarise(accidents=n())
head(E12Y,10) %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")

year	accidents
1908	1
1909	1
1912	1
1913	3
1915	2
1916	5
1917	7
1918	4
1919	8
1920	18

TotalAY=sum(E12Y$accidents)
meanAY=mean(E12Y$accidents)
devAY=sd(E12Y$accidents)


ggplot(PC,aes(x=year))+geom_bar()+geom_hline(yintercept=meanAY)+geom_hline(yintercept=meanAY+devAY, linetype="dashed")+geom_hline(yintercept=meanAY-devAY, linetype="dashed")

meanAM2=meanAM/111 #We have data from 111 years (1908-2018)
meanAY2=meanAY/12 

meanTotal=TotalAY/(111*12)

print(TotalAM) #The sum of nº of accidents of all the years = sum of all accidents of all months

## [1] 5783

print(meanTotal)

## [1] 4.341592

print(meanAM2)

## [1] 4.341592

print(meanAY2) #This mean is slightly bigger because when we did the mean with the accidents/year it don't count 3 years (1910, 1911 and 1914) where we don't have any accident

## [1] 4.462191

ACCIDENTS per MONTH

As a conclusion we may say that apparently most accidents occur during December and January. Maybe beacause the frequency of flights increases during these months.

Nevertheless, from August until the end of the year, the number of accidents reaches or even surpasses the 500, which is also noticeable. It could be understandable in August, as in summer most of the population tends to travel, and that could also be a reason for the increase in the frequency of flights.

All in all, in the light of these results we may say that winter is the most dangerous season of the year to travel, followed by August, in which the number of accidents is not low either. Having said so, spring would be the safest season of the year to travel.

ACCIDENTS per YEAR

In this other graphic we can confirm what we had previously expected.

The number of accidents from the fist years of the dataset is low. This may be due not only to the fact that so many years ago the frequency of flights was very low, but also because not all the accidents should have been noted in the record.

As well as that, these last years the number of accidents has decreased noticeably as an effect of the incorporation of advanced technology in flights and in the aerospatial sector in general.

Exercice 1.3: What is the number of crashes of every company? What are the 5 operators with more crashes?

E13=PC %>% group_by(operator) %>% summarise(accidents=n())%>%
  arrange(-accidents)

head(E13,10) %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")

operator	accidents
Aeroflot	260
Military - U.S. Air Force	177
Air France	72
Deutsche Lufthansa	64
United Air Lines	44
China National Aviation Corporation	43
Military - U.S. Army Air Forces	43
Pan American World Airways	41
American Airlines	37
Military - Royal Air Force	36

ggplot(E13[c(1:5),],aes(x=reorder(operator, -accidents), y=accidents),)+geom_bar(stat="identity")

ggplot(E13[c(1:50),],aes(x=reorder(operator, -accidents), y=accidents),)+geom_bar(stat="identity")

ggplot(E13 ,aes(x=reorder(operator, -accidents), y=accidents),)+geom_bar(stat="identity")

Aeroflot, the main national operator in Russia, stands for the company with a higher lever of accidents in the record, followed by the Military U.S Air Force. The 3rd operator is Air France, followed by Deutsche Lufthansa, and in the 5th position we find United Air Lines, a north-american operator.

The second graph is the top50 aerospacial companies with the highest accidents rate.

The third graph shows all the companies in the current database used.As we can see, the majority of these companies have 1 or 2 accidents, that can be true or it can be because we do not have enough information of these companies in the current database.

The first (top5) and the second (top50) graph are more useful to make an study, but the third one (all the companies) is not accurate enough.

Exercice 1.4: Nº of fatalities (and mean an deviance) of every operator

E14=PC %>% group_by(operator) %>% summarise(total_fatalities=sum(fatalities),mean_fatalities=mean(fatalities), deviance_fatalities=sd(fatalities)) %>% arrange(-1*total_fatalities)

head(E14,10) %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")

operator	total_fatalities	mean_fatalities	deviance_fatalities
Aeroflot	9048	34.80000	34.306117
Military - U.S. Air Force	3718	21.00565	23.008027
American Airlines	1422	38.43243	64.417708
Pan American World Airways	1303	31.78049	49.358136
Military - U.S. Army Air Forces	1070	24.88372	9.781443
United Air Lines	1019	23.15909	23.634763
AVIANCA	941	39.20833	45.066115
Turkish Airlines (THY)	891	63.64286	90.620601
Indian Airlines	861	25.32353	30.985550
China Airlines (Taiwan)	847	60.50000	92.945600

ggplot(PC,aes(as.factor(operator), fatalities))+geom_boxplot()

These are the box plots of the fatalities of all operators. Due to the large number of operators in our dataset, the diferent boxplots of all the fatalities do not give us relevant information as we can see.

Exercice 1.5: Nº of fatalities (and mean an deviance) of every type of airplain

E15=PC %>% group_by(aircraft) %>% summarise(total_fatalities=sum(fatalities),mean_fatalities=mean(fatalities), deviance_fatalities=sd(fatalities)) %>% arrange(-1*total_fatalities)

head(E15,10) %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")

aircraft	total_fatalities	mean_fatalities	deviance_fatalities
Douglas DC-3	4793	14.055718	9.668314
Douglas DC-6B	1054	37.642857	29.581293
Antonov AN-26	1042	28.944444	17.456795
Ilyushin IL-18B	1008	67.200000	29.854887
McDonnell Douglas DC-9-32	951	50.052632	37.171485
Douglas DC-4	937	22.853659	23.367243
de Havilland Canada DHC-6 Twin Otter 300	848	9.860465	6.832729
Yakovlev YAK-40	828	22.378378	17.007109
Tupolev TU-134A	808	47.529412	28.670799
McDonnell Douglas DC-10-10	804	134.000000	143.634258

ggplot(PC,aes(as.factor(aircraft), fatalities))+geom_boxplot()

In this exercise we have come to the same conclusion as in the one before. Due to the large number of different types of aircrafts in the dataset, the diferent boxplots of all the fatalities do not give us relevant information as we can see in the graph above.

Exercice 1.6: Relation between fatalities and nº of passangers (table and graphic representation):

E16T=PC %>% group_by(FxP=fatalities/passangers) %>% summarise(cases=n()) %>% arrange(-1*(FxP))

head(E16T,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")

FxP	cases
1.0000000	3810
0.9939024	1
0.9935484	1
0.9934641	1
0.9925926	1
0.9923664	1
0.9911504	1
0.9909910	1
0.9903846	1
0.9902913	1

E16R=PC %>% group_by(FxP=round(fatalities/passangers, 2)) %>% summarise(cases=n()) %>% arrange(-1*(FxP))

head(E16R,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")

FxP	cases
1.00	3810
0.99	16
0.98	21
0.97	24
0.96	33
0.95	32
0.94	35
0.93	31
0.92	22
0.91	25

ggplot(PC, aes(x=passangers, y=fatalities))+geom_point()+geom_abline(slope=1, linetype='dashed')

In this exercise we have studied the relation between the fatalities and the number of passangers in each of the accidents. Each of the points that we can see in this graph stands for an accident.

The diagonal line (x=y) represents the accidents where the number of fatalities has been the same as the number of passengers (which is the same to say that everyone in the flight has died). Obviously, all of the points are below this line.

Exercice 1.7: What is the number of fatalities that happened in every month? What is the evolution of fatalities/year?

E17M=PC %>% group_by(Nmonth) %>% summarise(fatalities=sum(fatalities, na.rm=TRUE))
E17M %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive")) %>%
  column_spec(1,bold=T,border_right = T,background = "red")

Nmonth	fatalities
1	9146
2	8672
3	9560
4	7739
5	8052
6	8714
7	10650
8	10806
9	10899
10	8864
11	10702
12	11318

TotalFM=sum(E17M$fatalities)
meanFM=mean(E17M$fatalities)
devFM=sd(E17M$fatalities)

ggplot(E17M,aes(x=Nmonth, y=fatalities))+geom_bar(stat="identity")+geom_hline(yintercept=meanFM)+geom_hline(yintercept=meanFM+devFM, linetype="dashed")+geom_hline(yintercept=meanFM-devFM, linetype="dashed")

E17Y=PC %>% group_by(year) %>% summarise(fatalities=sum(fatalities, na.rm=TRUE))
head(E17Y,10) %>% kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")

year	fatalities
1908	1
1909	1
1912	5
1913	45
1915	40
1916	108
1917	138
1918	65
1919	20
1920	25

TotalFY=sum(E17Y$fatalities)
meanFY=mean(E17Y$fatalities)
devFY=sd(E17Y$fatalities)


ggplot(E17Y,aes(x=year, y=fatalities))+geom_bar(stat="identity")+geom_hline(yintercept=meanFY)+geom_hline(yintercept=meanFY+devFY, linetype="dashed")+geom_hline(yintercept=meanFY-devFY, linetype="dashed")

meanFM2=meanFM/111 #We have data from 111 years (1908-2018)
meanFY2=meanFY/12 

meanTotalF=TotalFY/(111*12)

print(TotalFM) #The sum of nº of accidents of all the years = sum of all accidents of all months

## [1] 115122

print(meanTotalF)

## [1] 86.42793

print(meanFM2)

## [1] 86.42793

print(meanFY2) #This mean is slightly bigger because when we did the mean with the accidents/year it don't count 3 years (1910, 1911 and 1914) where we don't have any accident

## [1] 88.8287

In this graphics we have studied the total number of fatalities per month.

MONTHLY

As it could be expected considering that December is the month of the year with a higher level of accidents, the month where there are more fatalities is also December. Surprisingly, from July to September the number of fatalities is very high too regarding the level of accidents in these years, which is not as high as in December. In addition, January - where we have seen that the number of accidents is high - we could think that the number of fatalities could be similar than December’s. Instead, the number of fatalities is significantly low, everything considered. Therefore, accidents occuring in the second half of the year appear to have more fatalities than those at the beginning of the year. As well as that, when January and December are concerned, although there are many accidents occuring by then, it seems safer to travel in January rather tham in December.

THROUGHOUT THE YEARS

In the second graphic we have come across a similar distribution to the one in the graphic containing the accidents per year that we have computed before. The reason may be the same, highlighting the possible missing information in the first years of the dataset, and confirming the positive effects of new technologies during the last years, where the fatalities decrease significantly.

Part 2

We will consider that in our exercise all the probabilities are conditional probabilities given that there has been an accident; P(accident)=1.

Exercice 2.1: All the accidents, fatalities and survivors by month and year

E21A=PC %>% group_by(year, Nmonth) %>% 
  summarise(Accidents=n()) %>%
  spread(key=Nmonth, value=Accidents)
tail(E21A,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")

year	1	2	3	4	5	6	7	8	9	10	11	12
2009	4	7	4	6	3	5	4	5	3	3	7	1
2010	5	2	NA	4	5	3	4	10	4	5	4	2
2011	2	4	1	3	3	1	6	6	9	4	4	3
2012	2	NA	2	3	2	5	NA	2	2	2	3	6
2013	3	1	4	2	1	1	2	2	NA	5	9	3
2014	2	4	3	3	2	1	5	5	2	1	NA	5
2015	1	1	2	1	NA	2	2	2	2	4	3	2
2016	1	3	4	3	1	NA	2	2	NA	1	2	5
2017	1	1	2	NA	3	2	1	NA	NA	1	1	2
2018	1	3	4	2	2	1	1	1	1	1	1	NA

E21F=PC %>% group_by(year, Nmonth) %>% 
  summarise(Fatalities=sum(fatalities)) %>%
  spread(key=Nmonth, value=Fatalities)
tail(E21F,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")

year	1	2	3	4	5	6	7	8	9	10	11	12
2009	23	107	44	65	120	397	226	46	14	11	36	6
2010	104	18	NA	114	318	26	173	117	38	51	104	24
2011	80	35	9	57	56	47	191	75	146	39	17	12
2012	7	NA	14	173	60	187	NA	35	29	18	18	55
2013	30	5	28	7	4	20	13	6	NA	92	121	15
2014	4	109	246	18	22	49	484	55	14	2	NA	186
2015	37	40	160	7	NA	131	12	55	10	249	63	10
2016	2	26	94	30	66	NA	45	5	NA	5	75	171
2017	4	5	13	NA	6	125	16	NA	NA	4	11	13
2018	12	140	100	258	121	10	1	20	1	189	1	NA

E21S=PC %>% group_by(year, Nmonth) %>% 
  summarise(Survivors=sum(passangers)-sum(fatalities)) %>%
  spread(key=Nmonth, value=Survivors)
tail(E21S,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "green")

year	1	2	3	4	5	6	7	8	9	10	11	12
2009	156	129	1	10	14	1	142	71	5	9	23	1
2010	8	0	NA	11	8	0	2	190	37	0	3	167
2011	149	6	0	6	0	5	68	7	15	17	3	0
2012	3	NA	0	10	6	4	NA	1	4	8	0	72
2013	0	40	3	108	0	0	304	2	NA	18	33	8
2014	5	4	0	0	1	0	11	11	5	2	NA	10
2015	0	18	0	0	NA	0	4	5	4	0	13	4
2016	0	80	1	0	0	NA	0	300	NA	0	6	1
2017	0	0	0	NA	1	0	0	NA	NA	6	0	24
2018	0	4	21	148	1	0	18	0	46	0	127	NA

This table containing all the accidents, fatalities and survivors classified by month and year might be very useful in computing any probability. Marginal probabilities are quiclkly deduced ffrom it, and in the calculus of conditional probabilities it can be a useful tool.

Exercice 2.2: Given that it’s Decembre, probability of +50 fatalities

AccidentsDecember <- E12M$accidents[E12M$Nmonth==12]
totalAccidents <- TotalAM
pDecember <- AccidentsDecember/totalAccidents

filtrarJoinedProb <- PC %>% filter(fatalities>50,Nmonth==12)
countJoinedProb <- count(filtrarJoinedProb) #nombre d'accidents al desembre amb +50 fatalities

JoinedProb <- countJoinedProb/totalAccidents

conditionalProb <- JoinedProb/pDecember

print(as.numeric(conditionalProb))

## [1] 0.0979021

percentage=(as.numeric(conditionalProb))*100
cat("\nThis conditional probability equals the", percentage,"%")

## 
## This conditional probability equals the 9.79021 %

Exercice 2.3: Given that the operator is “Aeroflot” probab of having used a specific aircraft

#AEROFLOT
E231 <- PC%>%filter(operator=='Aeroflot')%>%group_by(aircraft)%>%summarise(count_Aeroflot=n())%>%arrange(-count_Aeroflot)
head(E231,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "orange")

aircraft	count_Aeroflot
Yakovlev YAK-40	19
Antonov AN-24	13
Ilyushin IL-12	13
Ilyushin IL-14P	11
Tupolev TU-104B	10
Tupolev TU-134A	10
Ilyushin IL-18B	9
Li-2	8
Tupolev TU-124	8
Antonov An-24B	6

#MILITARY - U.S AIRFORCE
E232 <- PC%>%filter(operator=='Military - U.S. Air Force')%>%group_by(aircraft)%>%summarise(count_Military_US_AirForce=n())%>%arrange(-count_Military_US_AirForce)
head(E232,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "orange")

aircraft	count_Military_US_AirForce
Boeing KC-135A	15
Lockheed C-130E Hercules	12
Lockheed C-130H Hercules	6
Lockheed C-130A Hercules	5
Douglas C-47D	4
Fairchild C-123K	4
Lockheed AC-130A Hercules	4
Lockheed C-130H	4
Boeing B-29	3
Douglas C-124A Globemaster	3

#AIRFRANCE
E233 <- PC%>%filter(operator=='Air France')%>%group_by(aircraft)%>%summarise(count_AirFrance=n())%>%arrange(-count_AirFrance)
head(E233,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "orange")

aircraft	count_AirFrance
Dewoitine D-338	4
Douglas DC-3	4
Boeing B-707-328	2
Douglas DC-3D	2
Douglas DC-4	2
Douglas DC-4-1009	2
Junkers JU-52/3m	2
Potez 621	2
Aerospatiale BAe Concorde 101	1
Airbus A-340	1

#DEUTSCHE LUFTHANSA
E234 <- PC%>%filter(operator=='Deutsche Lufthansa')%>%group_by(aircraft)%>%summarise(count_DeutscheLufthansa=n())%>%arrange(-count_DeutscheLufthansa)
head(E234,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "orange")

aircraft	count_DeutscheLufthansa
Junkers JU-52/3m	16
Junkers F-13	7
Dornier Merkur	3
Focke-Wulf FW 200	3
Fokker FG III	3
Junkers JU-52	3
Douglas DC-3	2
Heinkel He-70	2
AEGK	1
Arado V1	1

#UNITED AIR LINES
E235 <- PC%>%filter(operator=='United Air Lines')%>%group_by(aircraft)%>%summarise(count_UnitedAirLines=n())%>%arrange(-count_UnitedAirLines)
head(E235,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "orange")

aircraft	count_UnitedAirLines
Douglas DC-3	3
Douglas DC-3A	3
Douglas DC-6	3
Douglas DC-6B	3
Boeing 247	2
Boeing B-727-22	2
Boeing B-747-122	2
Douglas DC-4	2
Douglas DST-A-207A	2
Vickers Viscount 745D	2

Each of these tables show the five top operators (the five with more crashes), with the different aircrafts that they used and their frequency of use.

Exercise 2.3.1. Given that we have had an accident with Aeroflot, what is the probability that the aircraft was a Yakovlev

total_Aeroflot <- sum(E231$count_Aeroflot)
Conditional_Yakovlev<-  (E231%>%filter(aircraft=="Yakovlev YAK-40")%>%select(count_Aeroflot))/total_Aeroflot
Conditional_Yakovlev <- as.numeric(Conditional_Yakovlev)
cat("Probability that in these conditions the aircraft is a Yakovlev:",Conditional_Yakovlev,",which is a",round((Conditional_Yakovlev*100),2),"%")

## Probability that in these conditions the aircraft is a Yakovlev: 0.07307692 ,which is a 7.31 %

Conditional_Douglas<-  (E231%>%filter(aircraft=="Douglas C-47")%>%select(count_Aeroflot))/total_Aeroflot 
Conditional_Douglas <- as.numeric(Conditional_Douglas)
cat("\nProbability that in these conditions the aircraft is a Douglas C-47:",Conditional_Douglas,",which is a",round((Conditional_Douglas*100),2),"%")

## 
## Probability that in these conditions the aircraft is a Douglas C-47: 0.007692308 ,which is a 0.77 %

Exercise 2.3.2. Given that we have had an accident with Military U.S. Air Force, what is the probability that the aircraft was a Boeing KC-135Av and a Fairchild C-119C

total_Military <- sum(E232$count_Military_US_AirForce)

Conditional_Military<-  (E232%>%filter(aircraft=="Boeing KC-135A")%>%select(count_Military_US_AirForce))/total_Military
Conditional_Military <- as.numeric(Conditional_Military)
cat("Probability that in these conditions the aircraft is a Boeing KC-135A:",Conditional_Military,",which is a",round((Conditional_Military*100),2),"%")

## Probability that in these conditions the aircraft is a Boeing KC-135A: 0.08474576 ,which is a 8.47 %

Conditional_Fairchild <-  (E232%>%filter(aircraft=="Fairchild C-119C")%>%select(count_Military_US_AirForce))/total_Military 
Conditional_Fairchild <- as.numeric(Conditional_Fairchild)
cat("\nProbability that in these conditions the aircraft is a Fairchild C-119C:",Conditional_Fairchild,",which is a",round((Conditional_Fairchild*100),2),"%")

## 
## Probability that in these conditions the aircraft is a Fairchild C-119C: 0.01694915 ,which is a 1.69 %

Part 3

Exercice 3.1: Simulate (with a poissone distribution) number of accidents of every month (sum of 111 years).How acurate is it?

SimAccidents1=rpois(12, meanAM)
E31=data.frame(Month=1:12,SimAccidents1,RealAccidents=E12M$accidents, Error=abs(SimAccidents1-E12M$accidents), RelError=abs(SimAccidents1-E12M$accidents)/E12M$accidents) 

E31 %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")

Month	SimAccidents1	RealAccidents	Error	RelError
1	502	536	34	0.0634328
2	460	432	28	0.0648148
3	501	491	10	0.0203666
4	446	415	31	0.0746988
5	485	402	83	0.2064677
6	451	419	32	0.0763723
7	523	475	48	0.1010526
8	457	525	68	0.1295238
9	437	508	71	0.1397638
10	486	500	14	0.0280000
11	478	508	30	0.0590551
12	446	572	126	0.2202797

RelErrorSA=mean(E31$RelError)
print(RelErrorSA)

## [1] 0.09865234

It must be taken into account that this is a simulation, so the average of the error may vary every time that we simulate it.

Exercice 3.2: Simulate (with a poissone distribution) the number of accidents of every month in every year

E32=data.frame(SimYear=c(1:111))
for (i in 1:12){
  E32[i+1]=rpois(111, meanAM2) 
}
head(E32,10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "yellow")

SimYear	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13
1	2	6	4	6	5	2	7	2	4	2	7	4
2	11	4	2	7	2	3	5	2	7	10	8	7
3	6	4	5	4	3	8	4	4	1	1	5	6
4	3	5	2	7	5	5	7	10	9	2	1	9
5	1	6	2	4	0	8	8	7	0	4	6	5
6	1	9	6	5	0	6	3	6	3	4	4	6
7	6	6	5	4	6	6	1	4	9	4	6	6
8	3	4	3	5	3	5	1	1	9	3	4	7
9	6	5	3	3	6	2	6	4	1	8	5	7
10	2	7	8	7	4	1	2	6	4	5	2	6

Exercice 3.3: Simulate (with a poissone distribution) number of fatalities of every month (sum of 111 years).How acurate is it (error between simulation and reality)?

SimFatalities1=rpois(12, meanFM)
E33=data.frame(Month=1:12,SimFatalities1,RealFatalities=E17M$fatalities, Error=abs(SimFatalities1-E17M$fatalities), RelError=abs(SimFatalities1-E17M$fatalities)/E17M$fatalities) 

E33 %>%
  kable() %>%
  kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
  column_spec(1,bold=T,border_right = T,background = "red")

Month	SimFatalities1	RealFatalities	Error	RelError
1	9553	9146	407	0.0445003
2	9673	8672	1001	0.1154290
3	9662	9560	102	0.0106695
4	9524	7739	1785	0.2306500
5	9753	8052	1701	0.2112519
6	9686	8714	972	0.1115446
7	9577	10650	1073	0.1007512
8	9584	10806	1222	0.1130853
9	9585	10899	1314	0.1205615
10	9552	8864	688	0.0776173
11	9453	10702	1249	0.1167072
12	9688	11318	1630	0.1440184

RelErrorSF=mean(E33$RelError)
print(RelErrorSF)

## [1] 0.1163988

In this exercise, we must not forget either that this is a simulation, so the average of the error may vary every time that we simulate it.

Conclusions

The results that we have observed with our analysis

After having studied several graphics and tables with this dataset about Historical Plane Crashes, these have lead us to some conclusions:

PART 1

To start with, we have observed that the peak of accidents, as well as the peak of fatalities, are both in December. Furthermore, the second half of the year is generally more conflitctive due to the fact that there are more fatalities in the accidents occured.
When analysing the accidents per operator, as well as the accidents per type of aircraft, specially when obtaining graphics, we have seen that due to the large number of operators and aircrafts, this graphics turn out to be inefficient, because we cannot analyse much about them if we don’t limit to the ones with a higher frequency (5 top operators, as can be seen in exercise 1.3).
In exercise 1.6 we have analysed the existing relation between the number of passengers in the flights and the fatalities after the accident. In the table we can see how many flights have had the same percentage of fatalities taking the number of passengers into consideration. In its graphic, the diagonal line (x=y) represents when fatalities = passengers (everyone dies). Obviously, all of the points, representing the accidents, are below this line.
In the last part of the first chapter of our project, as well as analysing the aforementioned tendency of accidents depending on the month, we have studied the evolution of the number of accidents throughout the 111 years of our dataset.

4.1. On the one hand, With the graphic that we have obtained, we have reached the conclusion that on the first years of our dataset the ratio of accidents is low because of two reasons. The firts one refers to the number of flights in those years, that was undoubtedly lower than these decades’. There were not as much aeroplanes and, as well as that, flying abroad wasn’t as frequent as it is today. Moreover, the second reason is the fact that so many years ago some accidents may not be detected or put in the record, so the number of accidents those years that we can see in our dataset might not be 100% true or reliable.

4.2. On the other hand, the incorporation of new technologies in the aerospatial sector and in daily life in general, ought to be the cause of the decrease in the number of accidents in the last years of our dataset, as we can see in the graphic.

PART 2

After having computed all the accidents, fatalities and survivors by month and year, we have analysed several conditioned probabilities. We have seen that the probabiliy of +50 fatalities in an accident, given that it has occured in December is close to the 10%. Although December is the month with more accidents and, linked to that, more fatalities (talking about the total number of them throughout the month), we can see if we have a look at the filtered table, that most accidents happening in December haven’t got a very high number of fatalities considering the capacity of the aircraft involved, so a 10% makes sense as a result to the question proposed.
We have also analysed the probability of having used a specific aircraft given that the operator is “Aeroflot” or “Military U.S., regarding a table containing all the accidents of each operator classified into the type of aircraft used.

PART 3

With a Poisson distribution we have simulated the number of accidents of every month, taking into consideration the 111 years of our dataset. In order to analyse its accuracy, we have calculated the error of every mesure, as it can be seen on the table computed, that has an average which is printed on the screen every time hat we simulate the program. We must factor in that due to the fact that it is a simulation, the error varies every time that we run it.Running the program several times we have obtained errors that vary from 0.087 to 0.13, which is not bad.
Using a Possion distribution too, we have simulated the number of accidents of every month in every year, which can be seen in the table configured with this purpose in the exercise 3.2.
At last, we have simulated (with a Poissone distribution again) the number of fatalities of every month (sum of 111 years).When checking its accuracy, we have computed the error between this simulation and reality, which is algo printed on the screen every time this exercise is simulated. As it can be seen, this last simulation is not inaccurate either.