In this project we are going to work with a dataset about Historical Plane Crashes, which analyses aerial accidents occured since 1908 to 2018. In this dataset we can find the date of most accidents, its time, location, operator and type of aircraft, as well as the number of flight and registration, its route and the passengers abroad and the fatalities of each accident. Moreover, we can also find a brief summary of the cause of each accident.
To start with, our first objective will be to convert the data type of each variable, as this dataset does not interpret the passengers and fatalities as integers or any other string as what it ought to be.
After having done so, our aim is to analyse and be able to give an answer to some questions that we have asked ourselves for the project. This will be divided into three different parts, each of which tries to compute:
Question 1. What is the number of crashes every month? What is the evolution of crashes throughout the years?
Question 2. Which are the operators of our dataset with more crashes?
Question 3. Can we find the number of fatalities of every operator?
Question 4. Analysis of the existing relation between the fatalities and number of passangers
Question 5. Can we compute the total number of fatalities that happened every month? What is its evolution throughout the years?
Question 1. Can we compute a table where all the accidents, survivors and fatalities appear classified by month and year? This will enable us to work with almost any probability that we come across.
Question 2. Given that is December, what is the probablility of more than 50 fatalities in an accident?
Question 3. Given that the operator of the flight is “Aeroflot”, what is the probability of having used a specific type of aircraft? Can we compute the same for the “Military U.S. Air Force”?
Question 1. Can we simulate the number of accidents of every month with a Poisson distribution? Can we determine the accuracy of this approximation?
Question 2. Can we simulate the number of accidents of every month in every year with this same distribution, and determine the accuracy of the approximation too?
Question 3. Can we simulate the total number of monthly fatalities? Can its accuracy be determined?
Bearing this questions in mind, we can now start our project. Due to the large tables that can appear in our project, we will only show the head of each of them, meaning its first rows.
library(ggplot2)
library (tidyr)
library(dplyr)
library(lubridate)
library(leaflet)
library(kableExtra)
Df1=read.csv("data/planeCrash.csv",header = TRUE,encoding = "UTF8")
Df2=data.frame(MDY=as.character(Df1$date))
Df3=separate(Df2, MDY, c("MD", "Y"), sep=",")
Df4=separate(Df3, MD, c("M", "D"), sep=" ")
nM1=Df4$M
month_order=c("January","February","March","April","May","June","July","August","September","October","November","December")
nM=NULL
for (i in 1:length(nM1))
for (y in 1:length(month_order))
if (nM1[i]==month_order[y])
nM[i]=y
Df5=data.frame(passangers=Df1$aboard)
Df6=separate(Df5, passangers, c("passangers", "other"), sep=" Â")
Df7=data.frame(fatalities=Df1$fatalities)
Df8=separate(Df7, fatalities, c("fatalities", "other"), sep=" Â")
PC=data.frame(year=as.numeric(Df4$Y), month=Df4$M, Nmonth=as.numeric(nM), day=as.numeric(Df4$D), passangers=as.numeric(Df6$passangers), fatalities=as.numeric(Df8$fatalities), operator=as.character(Df1$operator), aircraft=as.character(Df1$ac_type), description=Df1$summary)
E11=PC %>% group_by(year, Nmonth) %>%
summarise(accidents=n()) %>%
spread(key=Nmonth, value=accidents)
tail(E11,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "yellow")
| year | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2009 | 4 | 7 | 4 | 6 | 3 | 5 | 4 | 5 | 3 | 3 | 7 | 1 |
| 2010 | 5 | 2 | NA | 4 | 5 | 3 | 4 | 10 | 4 | 5 | 4 | 2 |
| 2011 | 2 | 4 | 1 | 3 | 3 | 1 | 6 | 6 | 9 | 4 | 4 | 3 |
| 2012 | 2 | NA | 2 | 3 | 2 | 5 | NA | 2 | 2 | 2 | 3 | 6 |
| 2013 | 3 | 1 | 4 | 2 | 1 | 1 | 2 | 2 | NA | 5 | 9 | 3 |
| 2014 | 2 | 4 | 3 | 3 | 2 | 1 | 5 | 5 | 2 | 1 | NA | 5 |
| 2015 | 1 | 1 | 2 | 1 | NA | 2 | 2 | 2 | 2 | 4 | 3 | 2 |
| 2016 | 1 | 3 | 4 | 3 | 1 | NA | 2 | 2 | NA | 1 | 2 | 5 |
| 2017 | 1 | 1 | 2 | NA | 3 | 2 | 1 | NA | NA | 1 | 1 | 2 |
| 2018 | 1 | 3 | 4 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | NA |
In this table we can easily see all the accidents of our dataset classified by month and year of occurence. This table may be very useful so as to see, not only the tendency or evolution of fligts (which we will later compute), but also a very significative part of our dataset, with which we will work throughout our project. We must also note that NA stands for no accidents in that year and month.
E12M=PC %>% group_by(Nmonth) %>% summarise(accidents=n())
E12M %>% kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive")) %>%
column_spec(1,bold=T,border_right = T,background = "yellow")
| Nmonth | accidents |
|---|---|
| 1 | 536 |
| 2 | 432 |
| 3 | 491 |
| 4 | 415 |
| 5 | 402 |
| 6 | 419 |
| 7 | 475 |
| 8 | 525 |
| 9 | 508 |
| 10 | 500 |
| 11 | 508 |
| 12 | 572 |
TotalAM=sum(E12M$accidents)
meanAM=mean(E12M$accidents)
devAM=sd(E12M$accidents)
ggplot(PC,aes(x=Nmonth))+geom_histogram(binwidth = 0.5)+geom_hline(yintercept=meanAM)+geom_hline(yintercept=meanAM+devAM, linetype="dashed")+geom_hline(yintercept=meanAM-devAM, linetype="dashed")
E12Y=PC %>% group_by(year) %>% summarise(accidents=n())
head(E12Y,10) %>% kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "yellow")
| year | accidents |
|---|---|
| 1908 | 1 |
| 1909 | 1 |
| 1912 | 1 |
| 1913 | 3 |
| 1915 | 2 |
| 1916 | 5 |
| 1917 | 7 |
| 1918 | 4 |
| 1919 | 8 |
| 1920 | 18 |
TotalAY=sum(E12Y$accidents)
meanAY=mean(E12Y$accidents)
devAY=sd(E12Y$accidents)
ggplot(PC,aes(x=year))+geom_bar()+geom_hline(yintercept=meanAY)+geom_hline(yintercept=meanAY+devAY, linetype="dashed")+geom_hline(yintercept=meanAY-devAY, linetype="dashed")
meanAM2=meanAM/111 #We have data from 111 years (1908-2018)
meanAY2=meanAY/12
meanTotal=TotalAY/(111*12)
print(TotalAM) #The sum of nº of accidents of all the years = sum of all accidents of all months
## [1] 5783
print(meanTotal)
## [1] 4.341592
print(meanAM2)
## [1] 4.341592
print(meanAY2) #This mean is slightly bigger because when we did the mean with the accidents/year it don't count 3 years (1910, 1911 and 1914) where we don't have any accident
## [1] 4.462191
ACCIDENTS per MONTH
As a conclusion we may say that apparently most accidents occur during December and January. Maybe beacause the frequency of flights increases during these months.
Nevertheless, from August until the end of the year, the number of accidents reaches or even surpasses the 500, which is also noticeable. It could be understandable in August, as in summer most of the population tends to travel, and that could also be a reason for the increase in the frequency of flights.
All in all, in the light of these results we may say that winter is the most dangerous season of the year to travel, followed by August, in which the number of accidents is not low either. Having said so, spring would be the safest season of the year to travel.
ACCIDENTS per YEAR
In this other graphic we can confirm what we had previously expected.
The number of accidents from the fist years of the dataset is low. This may be due not only to the fact that so many years ago the frequency of flights was very low, but also because not all the accidents should have been noted in the record.
As well as that, these last years the number of accidents has decreased noticeably as an effect of the incorporation of advanced technology in flights and in the aerospatial sector in general.
E13=PC %>% group_by(operator) %>% summarise(accidents=n())%>%
arrange(-accidents)
head(E13,10) %>% kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "yellow")
| operator | accidents |
|---|---|
| Aeroflot | 260 |
| Military - U.S. Air Force | 177 |
| Air France | 72 |
| Deutsche Lufthansa | 64 |
| United Air Lines | 44 |
| China National Aviation Corporation | 43 |
| Military - U.S. Army Air Forces | 43 |
| Pan American World Airways | 41 |
| American Airlines | 37 |
| Military - Royal Air Force | 36 |
ggplot(E13[c(1:5),],aes(x=reorder(operator, -accidents), y=accidents),)+geom_bar(stat="identity")
ggplot(E13[c(1:50),],aes(x=reorder(operator, -accidents), y=accidents),)+geom_bar(stat="identity")
ggplot(E13 ,aes(x=reorder(operator, -accidents), y=accidents),)+geom_bar(stat="identity")
Aeroflot, the main national operator in Russia, stands for the company with a higher lever of accidents in the record, followed by the Military U.S Air Force. The 3rd operator is Air France, followed by Deutsche Lufthansa, and in the 5th position we find United Air Lines, a north-american operator.
The second graph is the top50 aerospacial companies with the highest accidents rate.
The third graph shows all the companies in the current database used.As we can see, the majority of these companies have 1 or 2 accidents, that can be true or it can be because we do not have enough information of these companies in the current database.
The first (top5) and the second (top50) graph are more useful to make an study, but the third one (all the companies) is not accurate enough.
E14=PC %>% group_by(operator) %>% summarise(total_fatalities=sum(fatalities),mean_fatalities=mean(fatalities), deviance_fatalities=sd(fatalities)) %>% arrange(-1*total_fatalities)
head(E14,10) %>% kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "red")
| operator | total_fatalities | mean_fatalities | deviance_fatalities |
|---|---|---|---|
| Aeroflot | 9048 | 34.80000 | 34.306117 |
| Military - U.S. Air Force | 3718 | 21.00565 | 23.008027 |
| American Airlines | 1422 | 38.43243 | 64.417708 |
| Pan American World Airways | 1303 | 31.78049 | 49.358136 |
| Military - U.S. Army Air Forces | 1070 | 24.88372 | 9.781443 |
| United Air Lines | 1019 | 23.15909 | 23.634763 |
| AVIANCA | 941 | 39.20833 | 45.066115 |
| Turkish Airlines (THY) | 891 | 63.64286 | 90.620601 |
| Indian Airlines | 861 | 25.32353 | 30.985550 |
| China Airlines (Taiwan) | 847 | 60.50000 | 92.945600 |
ggplot(PC,aes(as.factor(operator), fatalities))+geom_boxplot()
These are the box plots of the fatalities of all operators. Due to the large number of operators in our dataset, the diferent boxplots of all the fatalities do not give us relevant information as we can see.
E15=PC %>% group_by(aircraft) %>% summarise(total_fatalities=sum(fatalities),mean_fatalities=mean(fatalities), deviance_fatalities=sd(fatalities)) %>% arrange(-1*total_fatalities)
head(E15,10) %>% kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "red")
| aircraft | total_fatalities | mean_fatalities | deviance_fatalities |
|---|---|---|---|
| Douglas DC-3 | 4793 | 14.055718 | 9.668314 |
| Douglas DC-6B | 1054 | 37.642857 | 29.581293 |
| Antonov AN-26 | 1042 | 28.944444 | 17.456795 |
| Ilyushin IL-18B | 1008 | 67.200000 | 29.854887 |
| McDonnell Douglas DC-9-32 | 951 | 50.052632 | 37.171485 |
| Douglas DC-4 | 937 | 22.853659 | 23.367243 |
| de Havilland Canada DHC-6 Twin Otter 300 | 848 | 9.860465 | 6.832729 |
| Yakovlev YAK-40 | 828 | 22.378378 | 17.007109 |
| Tupolev TU-134A | 808 | 47.529412 | 28.670799 |
| McDonnell Douglas DC-10-10 | 804 | 134.000000 | 143.634258 |
ggplot(PC,aes(as.factor(aircraft), fatalities))+geom_boxplot()
In this exercise we have come to the same conclusion as in the one before. Due to the large number of different types of aircrafts in the dataset, the diferent boxplots of all the fatalities do not give us relevant information as we can see in the graph above.
E16T=PC %>% group_by(FxP=fatalities/passangers) %>% summarise(cases=n()) %>% arrange(-1*(FxP))
head(E16T,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "red")
| FxP | cases |
|---|---|
| 1.0000000 | 3810 |
| 0.9939024 | 1 |
| 0.9935484 | 1 |
| 0.9934641 | 1 |
| 0.9925926 | 1 |
| 0.9923664 | 1 |
| 0.9911504 | 1 |
| 0.9909910 | 1 |
| 0.9903846 | 1 |
| 0.9902913 | 1 |
E16R=PC %>% group_by(FxP=round(fatalities/passangers, 2)) %>% summarise(cases=n()) %>% arrange(-1*(FxP))
head(E16R,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "red")
| FxP | cases |
|---|---|
| 1.00 | 3810 |
| 0.99 | 16 |
| 0.98 | 21 |
| 0.97 | 24 |
| 0.96 | 33 |
| 0.95 | 32 |
| 0.94 | 35 |
| 0.93 | 31 |
| 0.92 | 22 |
| 0.91 | 25 |
ggplot(PC, aes(x=passangers, y=fatalities))+geom_point()+geom_abline(slope=1, linetype='dashed')
In this exercise we have studied the relation between the fatalities and the number of passangers in each of the accidents. Each of the points that we can see in this graph stands for an accident.
The diagonal line (x=y) represents the accidents where the number of fatalities has been the same as the number of passengers (which is the same to say that everyone in the flight has died). Obviously, all of the points are below this line.
E17M=PC %>% group_by(Nmonth) %>% summarise(fatalities=sum(fatalities, na.rm=TRUE))
E17M %>% kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive")) %>%
column_spec(1,bold=T,border_right = T,background = "red")
| Nmonth | fatalities |
|---|---|
| 1 | 9146 |
| 2 | 8672 |
| 3 | 9560 |
| 4 | 7739 |
| 5 | 8052 |
| 6 | 8714 |
| 7 | 10650 |
| 8 | 10806 |
| 9 | 10899 |
| 10 | 8864 |
| 11 | 10702 |
| 12 | 11318 |
TotalFM=sum(E17M$fatalities)
meanFM=mean(E17M$fatalities)
devFM=sd(E17M$fatalities)
ggplot(E17M,aes(x=Nmonth, y=fatalities))+geom_bar(stat="identity")+geom_hline(yintercept=meanFM)+geom_hline(yintercept=meanFM+devFM, linetype="dashed")+geom_hline(yintercept=meanFM-devFM, linetype="dashed")
E17Y=PC %>% group_by(year) %>% summarise(fatalities=sum(fatalities, na.rm=TRUE))
head(E17Y,10) %>% kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "red")
| year | fatalities |
|---|---|
| 1908 | 1 |
| 1909 | 1 |
| 1912 | 5 |
| 1913 | 45 |
| 1915 | 40 |
| 1916 | 108 |
| 1917 | 138 |
| 1918 | 65 |
| 1919 | 20 |
| 1920 | 25 |
TotalFY=sum(E17Y$fatalities)
meanFY=mean(E17Y$fatalities)
devFY=sd(E17Y$fatalities)
ggplot(E17Y,aes(x=year, y=fatalities))+geom_bar(stat="identity")+geom_hline(yintercept=meanFY)+geom_hline(yintercept=meanFY+devFY, linetype="dashed")+geom_hline(yintercept=meanFY-devFY, linetype="dashed")
meanFM2=meanFM/111 #We have data from 111 years (1908-2018)
meanFY2=meanFY/12
meanTotalF=TotalFY/(111*12)
print(TotalFM) #The sum of nº of accidents of all the years = sum of all accidents of all months
## [1] 115122
print(meanTotalF)
## [1] 86.42793
print(meanFM2)
## [1] 86.42793
print(meanFY2) #This mean is slightly bigger because when we did the mean with the accidents/year it don't count 3 years (1910, 1911 and 1914) where we don't have any accident
## [1] 88.8287
In this graphics we have studied the total number of fatalities per month.
MONTHLY
As it could be expected considering that December is the month of the year with a higher level of accidents, the month where there are more fatalities is also December. Surprisingly, from July to September the number of fatalities is very high too regarding the level of accidents in these years, which is not as high as in December. In addition, January - where we have seen that the number of accidents is high - we could think that the number of fatalities could be similar than December’s. Instead, the number of fatalities is significantly low, everything considered. Therefore, accidents occuring in the second half of the year appear to have more fatalities than those at the beginning of the year. As well as that, when January and December are concerned, although there are many accidents occuring by then, it seems safer to travel in January rather tham in December.
THROUGHOUT THE YEARS
In the second graphic we have come across a similar distribution to the one in the graphic containing the accidents per year that we have computed before. The reason may be the same, highlighting the possible missing information in the first years of the dataset, and confirming the positive effects of new technologies during the last years, where the fatalities decrease significantly.
We will consider that in our exercise all the probabilities are conditional probabilities given that there has been an accident; P(accident)=1.
E21A=PC %>% group_by(year, Nmonth) %>%
summarise(Accidents=n()) %>%
spread(key=Nmonth, value=Accidents)
tail(E21A,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "yellow")
| year | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2009 | 4 | 7 | 4 | 6 | 3 | 5 | 4 | 5 | 3 | 3 | 7 | 1 |
| 2010 | 5 | 2 | NA | 4 | 5 | 3 | 4 | 10 | 4 | 5 | 4 | 2 |
| 2011 | 2 | 4 | 1 | 3 | 3 | 1 | 6 | 6 | 9 | 4 | 4 | 3 |
| 2012 | 2 | NA | 2 | 3 | 2 | 5 | NA | 2 | 2 | 2 | 3 | 6 |
| 2013 | 3 | 1 | 4 | 2 | 1 | 1 | 2 | 2 | NA | 5 | 9 | 3 |
| 2014 | 2 | 4 | 3 | 3 | 2 | 1 | 5 | 5 | 2 | 1 | NA | 5 |
| 2015 | 1 | 1 | 2 | 1 | NA | 2 | 2 | 2 | 2 | 4 | 3 | 2 |
| 2016 | 1 | 3 | 4 | 3 | 1 | NA | 2 | 2 | NA | 1 | 2 | 5 |
| 2017 | 1 | 1 | 2 | NA | 3 | 2 | 1 | NA | NA | 1 | 1 | 2 |
| 2018 | 1 | 3 | 4 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | NA |
E21F=PC %>% group_by(year, Nmonth) %>%
summarise(Fatalities=sum(fatalities)) %>%
spread(key=Nmonth, value=Fatalities)
tail(E21F,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "red")
| year | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2009 | 23 | 107 | 44 | 65 | 120 | 397 | 226 | 46 | 14 | 11 | 36 | 6 |
| 2010 | 104 | 18 | NA | 114 | 318 | 26 | 173 | 117 | 38 | 51 | 104 | 24 |
| 2011 | 80 | 35 | 9 | 57 | 56 | 47 | 191 | 75 | 146 | 39 | 17 | 12 |
| 2012 | 7 | NA | 14 | 173 | 60 | 187 | NA | 35 | 29 | 18 | 18 | 55 |
| 2013 | 30 | 5 | 28 | 7 | 4 | 20 | 13 | 6 | NA | 92 | 121 | 15 |
| 2014 | 4 | 109 | 246 | 18 | 22 | 49 | 484 | 55 | 14 | 2 | NA | 186 |
| 2015 | 37 | 40 | 160 | 7 | NA | 131 | 12 | 55 | 10 | 249 | 63 | 10 |
| 2016 | 2 | 26 | 94 | 30 | 66 | NA | 45 | 5 | NA | 5 | 75 | 171 |
| 2017 | 4 | 5 | 13 | NA | 6 | 125 | 16 | NA | NA | 4 | 11 | 13 |
| 2018 | 12 | 140 | 100 | 258 | 121 | 10 | 1 | 20 | 1 | 189 | 1 | NA |
E21S=PC %>% group_by(year, Nmonth) %>%
summarise(Survivors=sum(passangers)-sum(fatalities)) %>%
spread(key=Nmonth, value=Survivors)
tail(E21S,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "green")
| year | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2009 | 156 | 129 | 1 | 10 | 14 | 1 | 142 | 71 | 5 | 9 | 23 | 1 |
| 2010 | 8 | 0 | NA | 11 | 8 | 0 | 2 | 190 | 37 | 0 | 3 | 167 |
| 2011 | 149 | 6 | 0 | 6 | 0 | 5 | 68 | 7 | 15 | 17 | 3 | 0 |
| 2012 | 3 | NA | 0 | 10 | 6 | 4 | NA | 1 | 4 | 8 | 0 | 72 |
| 2013 | 0 | 40 | 3 | 108 | 0 | 0 | 304 | 2 | NA | 18 | 33 | 8 |
| 2014 | 5 | 4 | 0 | 0 | 1 | 0 | 11 | 11 | 5 | 2 | NA | 10 |
| 2015 | 0 | 18 | 0 | 0 | NA | 0 | 4 | 5 | 4 | 0 | 13 | 4 |
| 2016 | 0 | 80 | 1 | 0 | 0 | NA | 0 | 300 | NA | 0 | 6 | 1 |
| 2017 | 0 | 0 | 0 | NA | 1 | 0 | 0 | NA | NA | 6 | 0 | 24 |
| 2018 | 0 | 4 | 21 | 148 | 1 | 0 | 18 | 0 | 46 | 0 | 127 | NA |
This table containing all the accidents, fatalities and survivors classified by month and year might be very useful in computing any probability. Marginal probabilities are quiclkly deduced ffrom it, and in the calculus of conditional probabilities it can be a useful tool.
AccidentsDecember <- E12M$accidents[E12M$Nmonth==12]
totalAccidents <- TotalAM
pDecember <- AccidentsDecember/totalAccidents
filtrarJoinedProb <- PC %>% filter(fatalities>50,Nmonth==12)
countJoinedProb <- count(filtrarJoinedProb) #nombre d'accidents al desembre amb +50 fatalities
JoinedProb <- countJoinedProb/totalAccidents
conditionalProb <- JoinedProb/pDecember
print(as.numeric(conditionalProb))
## [1] 0.0979021
percentage=(as.numeric(conditionalProb))*100
cat("\nThis conditional probability equals the", percentage,"%")
##
## This conditional probability equals the 9.79021 %
#AEROFLOT
E231 <- PC%>%filter(operator=='Aeroflot')%>%group_by(aircraft)%>%summarise(count_Aeroflot=n())%>%arrange(-count_Aeroflot)
head(E231,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "orange")
| aircraft | count_Aeroflot |
|---|---|
| Yakovlev YAK-40 | 19 |
| Antonov AN-24 | 13 |
| Ilyushin IL-12 | 13 |
| Ilyushin IL-14P | 11 |
| Tupolev TU-104B | 10 |
| Tupolev TU-134A | 10 |
| Ilyushin IL-18B | 9 |
| Li-2 | 8 |
| Tupolev TU-124 | 8 |
| Antonov An-24B | 6 |
#MILITARY - U.S AIRFORCE
E232 <- PC%>%filter(operator=='Military - U.S. Air Force')%>%group_by(aircraft)%>%summarise(count_Military_US_AirForce=n())%>%arrange(-count_Military_US_AirForce)
head(E232,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "orange")
| aircraft | count_Military_US_AirForce |
|---|---|
| Boeing KC-135A | 15 |
| Lockheed C-130E Hercules | 12 |
| Lockheed C-130H Hercules | 6 |
| Lockheed C-130A Hercules | 5 |
| Douglas C-47D | 4 |
| Fairchild C-123K | 4 |
| Lockheed AC-130A Hercules | 4 |
| Lockheed C-130H | 4 |
| Boeing B-29 | 3 |
| Douglas C-124A Globemaster | 3 |
#AIRFRANCE
E233 <- PC%>%filter(operator=='Air France')%>%group_by(aircraft)%>%summarise(count_AirFrance=n())%>%arrange(-count_AirFrance)
head(E233,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "orange")
| aircraft | count_AirFrance |
|---|---|
| Dewoitine D-338 | 4 |
| Douglas DC-3 | 4 |
| Boeing B-707-328 | 2 |
| Douglas DC-3D | 2 |
| Douglas DC-4 | 2 |
| Douglas DC-4-1009 | 2 |
| Junkers JU-52/3m | 2 |
| Potez 621 | 2 |
| Aerospatiale BAe Concorde 101 | 1 |
| Airbus A-340 | 1 |
#DEUTSCHE LUFTHANSA
E234 <- PC%>%filter(operator=='Deutsche Lufthansa')%>%group_by(aircraft)%>%summarise(count_DeutscheLufthansa=n())%>%arrange(-count_DeutscheLufthansa)
head(E234,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "orange")
| aircraft | count_DeutscheLufthansa |
|---|---|
| Junkers JU-52/3m | 16 |
| Junkers F-13 | 7 |
| Dornier Merkur | 3 |
| Focke-Wulf FW 200 | 3 |
| Fokker FG III | 3 |
| Junkers JU-52 | 3 |
| Douglas DC-3 | 2 |
| Heinkel He-70 | 2 |
| AEGK | 1 |
| Arado V1 | 1 |
#UNITED AIR LINES
E235 <- PC%>%filter(operator=='United Air Lines')%>%group_by(aircraft)%>%summarise(count_UnitedAirLines=n())%>%arrange(-count_UnitedAirLines)
head(E235,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "orange")
| aircraft | count_UnitedAirLines |
|---|---|
| Douglas DC-3 | 3 |
| Douglas DC-3A | 3 |
| Douglas DC-6 | 3 |
| Douglas DC-6B | 3 |
| Boeing 247 | 2 |
| Boeing B-727-22 | 2 |
| Boeing B-747-122 | 2 |
| Douglas DC-4 | 2 |
| Douglas DST-A-207A | 2 |
| Vickers Viscount 745D | 2 |
Each of these tables show the five top operators (the five with more crashes), with the different aircrafts that they used and their frequency of use.
total_Aeroflot <- sum(E231$count_Aeroflot)
Conditional_Yakovlev<- (E231%>%filter(aircraft=="Yakovlev YAK-40")%>%select(count_Aeroflot))/total_Aeroflot
Conditional_Yakovlev <- as.numeric(Conditional_Yakovlev)
cat("Probability that in these conditions the aircraft is a Yakovlev:",Conditional_Yakovlev,",which is a",round((Conditional_Yakovlev*100),2),"%")
## Probability that in these conditions the aircraft is a Yakovlev: 0.07307692 ,which is a 7.31 %
Conditional_Douglas<- (E231%>%filter(aircraft=="Douglas C-47")%>%select(count_Aeroflot))/total_Aeroflot
Conditional_Douglas <- as.numeric(Conditional_Douglas)
cat("\nProbability that in these conditions the aircraft is a Douglas C-47:",Conditional_Douglas,",which is a",round((Conditional_Douglas*100),2),"%")
##
## Probability that in these conditions the aircraft is a Douglas C-47: 0.007692308 ,which is a 0.77 %
total_Military <- sum(E232$count_Military_US_AirForce)
Conditional_Military<- (E232%>%filter(aircraft=="Boeing KC-135A")%>%select(count_Military_US_AirForce))/total_Military
Conditional_Military <- as.numeric(Conditional_Military)
cat("Probability that in these conditions the aircraft is a Boeing KC-135A:",Conditional_Military,",which is a",round((Conditional_Military*100),2),"%")
## Probability that in these conditions the aircraft is a Boeing KC-135A: 0.08474576 ,which is a 8.47 %
Conditional_Fairchild <- (E232%>%filter(aircraft=="Fairchild C-119C")%>%select(count_Military_US_AirForce))/total_Military
Conditional_Fairchild <- as.numeric(Conditional_Fairchild)
cat("\nProbability that in these conditions the aircraft is a Fairchild C-119C:",Conditional_Fairchild,",which is a",round((Conditional_Fairchild*100),2),"%")
##
## Probability that in these conditions the aircraft is a Fairchild C-119C: 0.01694915 ,which is a 1.69 %
SimAccidents1=rpois(12, meanAM)
E31=data.frame(Month=1:12,SimAccidents1,RealAccidents=E12M$accidents, Error=abs(SimAccidents1-E12M$accidents), RelError=abs(SimAccidents1-E12M$accidents)/E12M$accidents)
E31 %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "yellow")
| Month | SimAccidents1 | RealAccidents | Error | RelError |
|---|---|---|---|---|
| 1 | 502 | 536 | 34 | 0.0634328 |
| 2 | 460 | 432 | 28 | 0.0648148 |
| 3 | 501 | 491 | 10 | 0.0203666 |
| 4 | 446 | 415 | 31 | 0.0746988 |
| 5 | 485 | 402 | 83 | 0.2064677 |
| 6 | 451 | 419 | 32 | 0.0763723 |
| 7 | 523 | 475 | 48 | 0.1010526 |
| 8 | 457 | 525 | 68 | 0.1295238 |
| 9 | 437 | 508 | 71 | 0.1397638 |
| 10 | 486 | 500 | 14 | 0.0280000 |
| 11 | 478 | 508 | 30 | 0.0590551 |
| 12 | 446 | 572 | 126 | 0.2202797 |
RelErrorSA=mean(E31$RelError)
print(RelErrorSA)
## [1] 0.09865234
It must be taken into account that this is a simulation, so the average of the error may vary every time that we simulate it.
E32=data.frame(SimYear=c(1:111))
for (i in 1:12){
E32[i+1]=rpois(111, meanAM2)
}
head(E32,10) %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "yellow")
| SimYear | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 6 | 4 | 6 | 5 | 2 | 7 | 2 | 4 | 2 | 7 | 4 |
| 2 | 11 | 4 | 2 | 7 | 2 | 3 | 5 | 2 | 7 | 10 | 8 | 7 |
| 3 | 6 | 4 | 5 | 4 | 3 | 8 | 4 | 4 | 1 | 1 | 5 | 6 |
| 4 | 3 | 5 | 2 | 7 | 5 | 5 | 7 | 10 | 9 | 2 | 1 | 9 |
| 5 | 1 | 6 | 2 | 4 | 0 | 8 | 8 | 7 | 0 | 4 | 6 | 5 |
| 6 | 1 | 9 | 6 | 5 | 0 | 6 | 3 | 6 | 3 | 4 | 4 | 6 |
| 7 | 6 | 6 | 5 | 4 | 6 | 6 | 1 | 4 | 9 | 4 | 6 | 6 |
| 8 | 3 | 4 | 3 | 5 | 3 | 5 | 1 | 1 | 9 | 3 | 4 | 7 |
| 9 | 6 | 5 | 3 | 3 | 6 | 2 | 6 | 4 | 1 | 8 | 5 | 7 |
| 10 | 2 | 7 | 8 | 7 | 4 | 1 | 2 | 6 | 4 | 5 | 2 | 6 |
SimFatalities1=rpois(12, meanFM)
E33=data.frame(Month=1:12,SimFatalities1,RealFatalities=E17M$fatalities, Error=abs(SimFatalities1-E17M$fatalities), RelError=abs(SimFatalities1-E17M$fatalities)/E17M$fatalities)
E33 %>%
kable() %>%
kable_styling(bootstrap_options = c("stripder","hover","responsive"))%>%
column_spec(1,bold=T,border_right = T,background = "red")
| Month | SimFatalities1 | RealFatalities | Error | RelError |
|---|---|---|---|---|
| 1 | 9553 | 9146 | 407 | 0.0445003 |
| 2 | 9673 | 8672 | 1001 | 0.1154290 |
| 3 | 9662 | 9560 | 102 | 0.0106695 |
| 4 | 9524 | 7739 | 1785 | 0.2306500 |
| 5 | 9753 | 8052 | 1701 | 0.2112519 |
| 6 | 9686 | 8714 | 972 | 0.1115446 |
| 7 | 9577 | 10650 | 1073 | 0.1007512 |
| 8 | 9584 | 10806 | 1222 | 0.1130853 |
| 9 | 9585 | 10899 | 1314 | 0.1205615 |
| 10 | 9552 | 8864 | 688 | 0.0776173 |
| 11 | 9453 | 10702 | 1249 | 0.1167072 |
| 12 | 9688 | 11318 | 1630 | 0.1440184 |
RelErrorSF=mean(E33$RelError)
print(RelErrorSF)
## [1] 0.1163988
In this exercise, we must not forget either that this is a simulation, so the average of the error may vary every time that we simulate it.
After having studied several graphics and tables with this dataset about Historical Plane Crashes, these have lead us to some conclusions:
To start with, we have observed that the peak of accidents, as well as the peak of fatalities, are both in December. Furthermore, the second half of the year is generally more conflitctive due to the fact that there are more fatalities in the accidents occured.
When analysing the accidents per operator, as well as the accidents per type of aircraft, specially when obtaining graphics, we have seen that due to the large number of operators and aircrafts, this graphics turn out to be inefficient, because we cannot analyse much about them if we don’t limit to the ones with a higher frequency (5 top operators, as can be seen in exercise 1.3).
In exercise 1.6 we have analysed the existing relation between the number of passengers in the flights and the fatalities after the accident. In the table we can see how many flights have had the same percentage of fatalities taking the number of passengers into consideration. In its graphic, the diagonal line (x=y) represents when fatalities = passengers (everyone dies). Obviously, all of the points, representing the accidents, are below this line.
In the last part of the first chapter of our project, as well as analysing the aforementioned tendency of accidents depending on the month, we have studied the evolution of the number of accidents throughout the 111 years of our dataset.
4.1. On the one hand, With the graphic that we have obtained, we have reached the conclusion that on the first years of our dataset the ratio of accidents is low because of two reasons. The firts one refers to the number of flights in those years, that was undoubtedly lower than these decades’. There were not as much aeroplanes and, as well as that, flying abroad wasn’t as frequent as it is today. Moreover, the second reason is the fact that so many years ago some accidents may not be detected or put in the record, so the number of accidents those years that we can see in our dataset might not be 100% true or reliable.
4.2. On the other hand, the incorporation of new technologies in the aerospatial sector and in daily life in general, ought to be the cause of the decrease in the number of accidents in the last years of our dataset, as we can see in the graphic.
After having computed all the accidents, fatalities and survivors by month and year, we have analysed several conditioned probabilities. We have seen that the probabiliy of +50 fatalities in an accident, given that it has occured in December is close to the 10%. Although December is the month with more accidents and, linked to that, more fatalities (talking about the total number of them throughout the month), we can see if we have a look at the filtered table, that most accidents happening in December haven’t got a very high number of fatalities considering the capacity of the aircraft involved, so a 10% makes sense as a result to the question proposed.
We have also analysed the probability of having used a specific aircraft given that the operator is “Aeroflot” or “Military U.S., regarding a table containing all the accidents of each operator classified into the type of aircraft used.
With a Poisson distribution we have simulated the number of accidents of every month, taking into consideration the 111 years of our dataset. In order to analyse its accuracy, we have calculated the error of every mesure, as it can be seen on the table computed, that has an average which is printed on the screen every time hat we simulate the program. We must factor in that due to the fact that it is a simulation, the error varies every time that we run it.Running the program several times we have obtained errors that vary from 0.087 to 0.13, which is not bad.
Using a Possion distribution too, we have simulated the number of accidents of every month in every year, which can be seen in the table configured with this purpose in the exercise 3.2.
At last, we have simulated (with a Poissone distribution again) the number of fatalities of every month (sum of 111 years).When checking its accuracy, we have computed the error between this simulation and reality, which is algo printed on the screen every time this exercise is simulated. As it can be seen, this last simulation is not inaccurate either.