Introduction

The purpose of this study is to carry out an analysis of air accidents. The dataset was taken from kaggle . A data cleaning was not necessary, the dataframe was ready to use. From time to time, when I reduced the dataframe to make it more manageable and better visualisable, I cleaned the data, mainly by removing empty spaces. I would rename the blanks ‘NA’ and delete them. I displayed the datasets with the ‘Head’ command whenever they were too long so as not to stress the computer too much unnecessarily.

Importing libreries

library ('ggpmisc')
library("ggpubr")
library(tidyverse)
library(readr)
library(lubridate)
library(dplyr)
library(tibble)
library(scales)
library(qpcR)
library(stringr)
library(data.table)
library(ggplot2)

Importing Dataset

dataset<-read.csv("C:/Users/acer/OneDrive/Desktop/data analysis/flight_crashes/Airplane_Crashes_and_Fatalities_Since_1908.csv")

Check the Dataset

View (dataset)

knitr::kable(head(dataset[, 1:6]),"pipe")

index	Date	Time	Location	Operator	Flight..
0	09/17/1908	17:18	Fort Myer, Virginia	Military - U.S. Army
1	07/12/1912	06:30	AtlantiCity, New Jersey	Military - U.S. Navy
2	08/06/1913		Victoria, British Columbia, Canada	Private	-
3	09/09/1913	18:30	Over the North Sea	Military - German Navy
4	10/17/1913	10:30	Near Johannisthal, Germany	Military - German Navy
5	03/05/1915	01:00	Tienen, Belgium	Military - German Navy

Change location column, splitting it in 2: city and state

separate(dataset, col=Location, into=c('City', 'State'), sep=',')
dataset_ <- dataset %>% separate( col=Location, into=c('City', 'State'), sep=',', fill = "left")

knitr::kable(head(dataset_[, 1:6]),"pipe")

index	Date	Time	City	State	Operator
0	09/17/1908	17:18	Fort Myer	Virginia	Military - U.S. Army
1	07/12/1912	06:30	AtlantiCity	New Jersey	Military - U.S. Navy
2	08/06/1913		Victoria	British Columbia	Private
3	09/09/1913	18:30	NA	Over the North Sea	Military - German Navy
4	10/17/1913	10:30	Near Johannisthal	Germany	Military - German Navy
5	03/05/1915	01:00	Tienen	Belgium	Military - German Navy

Change format of date

str(dataset_)

## 'data.frame':    5268 obs. of  15 variables:
##  $ index       : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Date        : chr  "09/17/1908" "07/12/1912" "08/06/1913" "09/09/1913" ...
##  $ Time        : chr  "17:18" "06:30" "" "18:30" ...
##  $ City        : chr  "Fort Myer" "AtlantiCity" "Victoria" NA ...
##  $ State       : chr  " Virginia" " New Jersey" " British Columbia" "Over the North Sea" ...
##  $ Operator    : chr  "Military - U.S. Army" "Military - U.S. Navy" "Private" "Military - German Navy" ...
##  $ Flight..    : chr  "" "" "-" "" ...
##  $ Route       : chr  "Demonstration" "Test flight" "" "" ...
##  $ Type        : chr  "Wright Flyer III" "Dirigible" "Curtiss seaplane" "Zeppelin L-1 (airship)" ...
##  $ Registration: chr  "" "" "" "" ...
##  $ cn.In       : chr  "1" "" "" "" ...
##  $ Aboard      : num  2 5 1 20 30 41 19 20 22 19 ...
##  $ Fatalities  : num  1 5 1 14 30 21 19 20 22 19 ...
##  $ Ground      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Summary     : chr  "During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a heig"| __truncated__ "First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight." "The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed." "The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Is"| __truncated__ ...

Date_v1 <- mdy (dataset_$Date)
dataset_1 <- cbind (dataset_,Date_v1)
dataset_2 <- dataset_1 %>% dplyr::select (- Date) %>% relocate(Date_v1, .before =Time)
str(dataset_2)

## 'data.frame':    5268 obs. of  15 variables:
##  $ index       : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Date_v1     : Date, format: "1908-09-17" "1912-07-12" ...
##  $ Time        : chr  "17:18" "06:30" "" "18:30" ...
##  $ City        : chr  "Fort Myer" "AtlantiCity" "Victoria" NA ...
##  $ State       : chr  " Virginia" " New Jersey" " British Columbia" "Over the North Sea" ...
##  $ Operator    : chr  "Military - U.S. Army" "Military - U.S. Navy" "Private" "Military - German Navy" ...
##  $ Flight..    : chr  "" "" "-" "" ...
##  $ Route       : chr  "Demonstration" "Test flight" "" "" ...
##  $ Type        : chr  "Wright Flyer III" "Dirigible" "Curtiss seaplane" "Zeppelin L-1 (airship)" ...
##  $ Registration: chr  "" "" "" "" ...
##  $ cn.In       : chr  "1" "" "" "" ...
##  $ Aboard      : num  2 5 1 20 30 41 19 20 22 19 ...
##  $ Fatalities  : num  1 5 1 14 30 21 19 20 22 19 ...
##  $ Ground      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Summary     : chr  "During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a heig"| __truncated__ "First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight." "The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed." "The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Is"| __truncated__ ...

Analysis

Yearly crashes

yearly_crashes <- data.frame(table(as.numeric(format(dataset_2$Date_v1,'%Y'))))
yearly_crashes <- yearly_crashes %>%
  rename(year = Var1, num_of_crashes = Freq) %>%
  arrange(-num_of_crashes)

knitr::kable(yearly_crashes,"pipe")

year	num_of_crashes
1972	104
1968	96
1989	95
1967	91
1973	89
1979	89
1991	88
1970	87
1994	87
1976	86
1992	86
1988	83
1969	82
1974	82
1977	81
1996	81
1946	80
1995	79
1999	78
1978	77
1947	76
2000	76
1975	75
2002	75
1985	74
1987	74
1962	73
1990	72
1982	70
2001	70
1945	69
1951	69
1964	69
1965	69
1966	69
1998	69
1948	68
1997	68
1971	67
1993	67
1981	66
1980	65
1984	65
1986	64
1950	63
1953	62
1958	62
1960	62
2008	62
1952	61
1983	61
2003	61
2004	61
1949	60
1959	60
1954	59
1963	58
1957	57
2007	54
1955	52
1961	52
1938	51
1956	51
2005	51
2006	49
1944	47
1936	43
1943	39
1928	37
1929	37
1935	36
1931	32
1942	32
1934	30
1932	28
1937	27
1933	26
1939	26
1930	24
2009	24
1927	21
1941	21
1940	18
1920	17
1921	13
1923	12
1926	12
1922	11
1925	11
1924	7
1917	6
1919	6
1916	5
1918	4
1913	3
1915	2
1908	1
1912	1

ggplot(yearly_crashes, aes(x = year, y = num_of_crashes, group = 1)) + 
  geom_line(color = "black") +
  geom_point(color = "black") +
  labs(title = "Crashes Over The Years"
       , y = "Number of Crashes"
       , x = "Years") +
  annotate("rect", xmin=28, xmax= 34, ymin=0, ymax=Inf, alpha=0.2, fill="red")  + 
  scale_x_discrete(breaks = seq(1904, 2014, by = 2) ) +
  theme(text=element_text(size=16, 
                          family="Microsoft New Tai Lue Bold"),
        axis.text.x = element_text(angle = 90, size = 5))

Crashes per operator

crashes_per_operator <- data.frame(table(dataset_2$Operator))
crashes_per_operator <- crashes_per_operator %>% mutate(percentage = Freq/sum(Freq) * 100) %>% rename("num_of_crashes"=Freq, "Operator"= Var1) 
crashes_per_operator <- crashes_per_operator %>% arrange(-percentage)

knitr::kable(head(crashes_per_operator),"pipe")

Operator	num_of_crashes	percentage
Aeroflot	179	3.3978740
Military - U.S. Air Force	176	3.3409263
Air France	70	1.3287775
Deutsche Lufthansa	65	1.2338648
Air Taxi	44	0.8352316
China National Aviation Corporation	44	0.8352316

I’m going to subset it becouse there are too much data and it will be difficoul to plot it correctly. For this reason I’m going to subset it in a dataframe with data of operator wich have at least 25 crashes

crashes_per_operator25 <- crashes_per_operator %>% subset(num_of_crashes >24)

knitr::kable(head (crashes_per_operator25),"pipe")

Operator	num_of_crashes	percentage
Aeroflot	179	3.3978740
Military - U.S. Air Force	176	3.3409263
Air France	70	1.3287775
Deutsche Lufthansa	65	1.2338648
Air Taxi	44	0.8352316
China National Aviation Corporation	44	0.8352316

ggplot(crashes_per_operator25, aes (x = reorder (Operator, -num_of_crashes), y=num_of_crashes)) + 
  geom_col(aes(fill = Operator), width=0.4, position = position_dodge(width=0.5)) + 
  theme(axis.text.x= element_text(angle = 90, size = 7)) +
  labs(title = "Number of Crashes per Operator"
       , y = "Number of Crashes"
       , x = "Operator") +
  geom_text(aes(label = signif(num_of_crashes)), nudge_y = 5, size = 2) +
   theme(legend.text=element_text(size=5))

Crashes per state

Here I create a new dataframe, showing a crash per state dataframe. Again the dataframe it’s too big, so I’m going to reduce it as before.

state_crashes <- data.frame(table(dataset_2$State))
state_crashes <- state_crashes %>%
  rename(state = Var1, num_of_crashes = Freq) %>%
  arrange(-num_of_crashes)

state_crashes_70 <- state_crashes %>% subset(num_of_crashes > 70)

knitr::kable( state_crashes_70,"pipe")

state	num_of_crashes
Brazil	175
Alaska	173
Russia	166
Colombia	143
California	138
France	121
India	93
England	85
Mexico	80
Indonesia	79
China	78
Germany	78
Italy	71

ggplot(state_crashes_70, aes (x = reorder (state, -num_of_crashes), y=num_of_crashes)) + 
  geom_col(aes(fill = state), width=0.4, position = position_dodge(width=0.5)) + 
  theme(axis.text.x= element_text(angle = 90)) +
  labs(title = "Number of Crashes per State"
       , y = "Number of Crashes"
       , x = "State") + 
  geom_text(aes(label = signif(num_of_crashes)), nudge_y = 5)

Crashes per Type

Now, I create a new dataframe according to the type of airplane

dataset_types <- data.frame(table(dataset_2$Type))%>%
  rename(type = Var1, num_of_crashes = Freq) %>%
  arrange (-num_of_crashes)

knitr::kable(head (dataset_types),"pipe")

type	num_of_crashes
Douglas DC-3	334
de Havilland Canada DHC-6 Twin Otter 300	81
Douglas C-47A	74
Douglas C-47	62
Douglas DC-4	40
Yakovlev YAK-40	37

Again the dataframe is too big so I’m going to subset it, additionally the are a lot of blank values.I will only keep the data of aircraft types that have had at least 20 accidents.

dataset_types[ dataset_types=="" ] <- NA
dataset_types <- dataset_types %>% na.omit(dataset_types)
View (dataset_types)
dataset_type_20 <-dataset_types %>% subset (num_of_crashes>20)

knitr::kable(dataset_type_20,"pipe")

	type	num_of_crashes
1	Douglas DC-3	334
2	de Havilland Canada DHC-6 Twin Otter 300	81
3	Douglas C-47A	74
4	Douglas C-47	62
5	Douglas DC-4	40
6	Yakovlev YAK-40	37
7	Antonov AN-26	36
8	Junkers JU-52/3m	32
9	Douglas C-47B	29
10	De Havilland DH-4	28
12	Douglas DC-6B	27
13	Breguet 14	23
14	Curtiss C-46A	21

ggplot(dataset_type_20, aes (x = num_of_crashes, y=reorder (type, -num_of_crashes))) + 
  geom_col(aes(fill = type), width=0.4, position = position_dodge(width=0.2)) + 
  theme(axis.text.x= element_text( size = 8)) +
  theme(axis.text.y= element_text( size = 8)) +
  labs(title = "Number of Crashes per Type"
       , y = "Number of Crashes"
       , x = "Type") + 
  geom_text(aes(label = signif(num_of_crashes)), nudge_x = 8, size = 3) +
guides (theme(legend.text=element_text(size=1)))

Fatality

Let’s analyse the fatality. This time I didn’t subset the dataframe. I have just used the orginial one just using the “date” column and fatality” column

library(scales)

ggplot(data =dataset_2, aes(x = Date_v1, y = Fatalities))+ 
  geom_col(color = "#00B8E5", size = 0.5) +
  annotate(geom = "text", x = as.Date("1977-03-27"), y = 600, label = "Canary Island (583)", size = 2)+
  annotate(geom = "text", x = as.Date("1985-08-12"), y = 540, label = "Japan (520)", size = 2) + 
  annotate(geom = "text", x = as.Date("1996-11-12"), y = 360, label = "India (349)", size = 2) +
  geom_rect(aes(xmin = as.Date("1939-09-01"),xmax = as.Date("1945-09-02"),ymin = 0, ymax = Inf),
            fill="#00BDD0", 
            alpha = 0.002) + 
  scale_fill_brewer(palette = 'Dark2', name = 'Year')+
  theme_bw()+
  annotate(geom = "text", x = as.Date("1942-11-01"), y = 360, label = "WW2", size = 3)+
  labs (title = "Fatality per year", x = "Year", y= "Deaths") + 
  stat_smooth(fill="blue", colour="blue", size=1, alpha = 0.2, method = "gam")

Crashes per hours

Here I created a new dataset: crashes per hours. there were a lot of blank values and some wrong values. i corrected it and i deleted the balnk values.

dataset_hours <- data.frame(table(dataset_2$Time))%>%
  rename(Hour = Var1, num_of_crashes = Freq) %>%
  arrange (-num_of_crashes)
dataset_hours<-separate(dataset_hours, col=Hour, into=c('Hour', 'minutes'), sep=':')
dataset_hours24<-dataset_hours%>% group_by(Hour)
dataset_hours24<-dataset_hours24 %>%subset( select = c(Hour,num_of_crashes))
dataset_hours244<-dataset_hours24 %>% group_by(Hour) %>% summarise (num_of_crash = sum(num_of_crashes))
dataset_hours25<-dataset_hours244 %>% subset( Hour!="0943" & Hour!="1" & Hour!="114" & Hour!="12'20" & Hour!="18.40" & Hour!="2" & Hour!="22'08" & Hour!="c" & Hour!="c16" & Hour!="c14" & Hour!="8" & Hour!="9")
dataset_hours25[ dataset_hours25=="" ] <- NA
dataset_hours251 <- dataset_hours25 %>% na.omit(dataset_hours251)

ggplot(dataset_hours251, aes (x = Hour, y =num_of_crash))+
  geom_col (fill = "darkcyan")+
  labs(title = "Number of Crashes per Hours"
       , y = "Number of Crashes"
       , x = "Hours")

Crashes per route

Finally, crash by route, again i subsetted it (at least 3 crashes in the same route). At the very beginning I delated the NA values.

crashes_by_route <- data.frame(table(dataset_2$Route))
crashes_by_route <- crashes_by_route %>% mutate(percentage = Freq/sum(Freq) * 100) %>% rename("num_of_crashes"=Freq, "Route"= Var1) 
crashes_by_route <- crashes_by_route %>% arrange(-percentage)
crashes_by_route[ crashes_by_route=="" ] <- NA
crashes_by_route <- crashes_by_route %>% na.omit(crashes_by_route)

crashes_by_route3 <- crashes_by_route %>% subset(num_of_crashes > 2)

knitr::kable(crashes_by_route3,"pipe")

	Route	num_of_crashes	percentage
2	Training	81	1.5375854
3	Sightseeing	29	0.5504935
4	Test flight	17	0.3227031
5	Test	6	0.1138952
6	Sao Paulo - Rio de Janeiro	5	0.0949127
7	Bogota - Barranquilla	4	0.0759301
8	Saigon - Paris	4	0.0759301
9	Sao Paulo - Porto Alegre	4	0.0759301
10	Villavicencio - Mitu	4	0.0759301
11	Bangkok - Hong Kong	3	0.0569476
12	Barranquilla - Bogota	3	0.0569476
13	Buenos Aires - Santiago	3	0.0569476
14	Burbank - Oakland	3	0.0569476
15	Cleveland - Chicago	3	0.0569476
16	Davao - Manila	3	0.0569476
17	Demonstration	3	0.0569476
18	Guayaquil - Cuenca	3	0.0569476
19	Guayaquil - Quito	3	0.0569476
20	Huambo - Luanda	3	0.0569476
21	Hue - Da Nang	3	0.0569476
22	Istanbul - Ankara	3	0.0569476
23	Khartoum - Juba	3	0.0569476
24	Lagos - Abuja	3	0.0569476
25	London - Glasgow	3	0.0569476
26	Otis AFB	3	0.0569476
27	Paris - London	3	0.0569476
28	Positioning	3	0.0569476
29	Quito - Cuenca	3	0.0569476
30	Rio de Janeiro - Sao Paulo	3	0.0569476
31	Rome - Athens	3	0.0569476
32	Rome - Cairo	3	0.0569476
33	Santiago de Cuba - Havana	3	0.0569476
34	Villavicencio - Bogota	3	0.0569476

ggplot(crashes_by_route3, aes (y = reorder (Route, num_of_crashes), x =num_of_crashes))+
  geom_col (aes (fill = factor (num_of_crashes)))+
  theme(axis.text.x= element_text(angle = 90, size = 7))+
  labs(title = "Number of Crashes per Route"
       , y = "Route"
       , x = "Number of crashes") +
scale_fill_manual(values = c("#ED8141", "#BB9D00",
                             "#5BB300", "yellow",
                             "#00B8E5", "#CF78FF", "#FF65AE")) +
  guides(fill=guide_legend(title="Number of crashes"))

Conclusion

Some conclusions can be drawn from the analysis presented:

Since the beginning of the Second World War, aircraft accidents have increased, stabilising from 1945 to the present day.

The operators with the most accidents are Aeroflot (Russian) and U.S. Airforce (US). to be fair, looking at the data by ‘nationality’, most of the aircraft accidents come from US operators. This is also shown by the analysis of the type of aircraft that suffered the most accidents: Douglas DC-3 (334 accidents) Only at position number 6 can one find a Russian aircraft (Yakovlev YAK-40)

It is surprising how many collisions take place in south american territory: Brazil, Colombia, Mexico. of course we should not be surprised by the figures for North America and Russia. The routes also show an interesting figure: without considering the first three routes that are Training, Sightseeing, and Test flights that are ‘neutral’ routes, the sixth and seventh are South American routes. Sao Paulo - Rio and Bogota - Barranquilla

As far as fatality is concerned, the graph shows that it has increased over time. One can see the three peaks: the disaster in the Canary Islands, Japan and India. I have decided to highlight the period relating to the Second World War here as well.

Suggestions for future analysis

I believe that improvements can be made. In particular: - Re-do the calculations without calculating incidents in wars. - Re-aggregate the data on types: combine aircraft types that belong to the same category. - It would also be useful to perform an analysis without taking into account accidents occurring in Training processes.

Airplane Crashes Analysis with R

Matteo Pierangeli

2022-10-12