Introduction

The purpose of this study is to carry out an analysis of air accidents. The dataset was taken from kaggle . A data cleaning was not necessary, the dataframe was ready to use. From time to time, when I reduced the dataframe to make it more manageable and better visualisable, I cleaned the data, mainly by removing empty spaces. I would rename the blanks ‘NA’ and delete them. I displayed the datasets with the ‘Head’ command whenever they were too long so as not to stress the computer too much unnecessarily.

Importing libreries

library ('ggpmisc')
library("ggpubr")
library(tidyverse)
library(readr)
library(lubridate)
library(dplyr)
library(tibble)
library(scales)
library(qpcR)
library(stringr)
library(data.table)
library(ggplot2)

Importing Dataset

dataset<-read.csv("C:/Users/acer/OneDrive/Desktop/data analysis/flight_crashes/Airplane_Crashes_and_Fatalities_Since_1908.csv")

Check the Dataset

View (dataset)
knitr::kable(head(dataset[, 1:6]),"pipe")
index Date Time Location Operator Flight..
0 09/17/1908 17:18 Fort Myer, Virginia Military - U.S. Army
1 07/12/1912 06:30 AtlantiCity, New Jersey Military - U.S. Navy
2 08/06/1913 Victoria, British Columbia, Canada Private -
3 09/09/1913 18:30 Over the North Sea Military - German Navy
4 10/17/1913 10:30 Near Johannisthal, Germany Military - German Navy
5 03/05/1915 01:00 Tienen, Belgium Military - German Navy

Change location column, splitting it in 2: city and state

separate(dataset, col=Location, into=c('City', 'State'), sep=',')
dataset_ <- dataset %>% separate( col=Location, into=c('City', 'State'), sep=',', fill = "left")
knitr::kable(head(dataset_[, 1:6]),"pipe")
index Date Time City State Operator
0 09/17/1908 17:18 Fort Myer Virginia Military - U.S. Army
1 07/12/1912 06:30 AtlantiCity New Jersey Military - U.S. Navy
2 08/06/1913 Victoria British Columbia Private
3 09/09/1913 18:30 NA Over the North Sea Military - German Navy
4 10/17/1913 10:30 Near Johannisthal Germany Military - German Navy
5 03/05/1915 01:00 Tienen Belgium Military - German Navy

Change format of date

str(dataset_)
## 'data.frame':    5268 obs. of  15 variables:
##  $ index       : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Date        : chr  "09/17/1908" "07/12/1912" "08/06/1913" "09/09/1913" ...
##  $ Time        : chr  "17:18" "06:30" "" "18:30" ...
##  $ City        : chr  "Fort Myer" "AtlantiCity" "Victoria" NA ...
##  $ State       : chr  " Virginia" " New Jersey" " British Columbia" "Over the North Sea" ...
##  $ Operator    : chr  "Military - U.S. Army" "Military - U.S. Navy" "Private" "Military - German Navy" ...
##  $ Flight..    : chr  "" "" "-" "" ...
##  $ Route       : chr  "Demonstration" "Test flight" "" "" ...
##  $ Type        : chr  "Wright Flyer III" "Dirigible" "Curtiss seaplane" "Zeppelin L-1 (airship)" ...
##  $ Registration: chr  "" "" "" "" ...
##  $ cn.In       : chr  "1" "" "" "" ...
##  $ Aboard      : num  2 5 1 20 30 41 19 20 22 19 ...
##  $ Fatalities  : num  1 5 1 14 30 21 19 20 22 19 ...
##  $ Ground      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Summary     : chr  "During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a heig"| __truncated__ "First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight." "The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed." "The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Is"| __truncated__ ...
Date_v1 <- mdy (dataset_$Date)
dataset_1 <- cbind (dataset_,Date_v1)
dataset_2 <- dataset_1 %>% dplyr::select (- Date) %>% relocate(Date_v1, .before =Time)
str(dataset_2)
## 'data.frame':    5268 obs. of  15 variables:
##  $ index       : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Date_v1     : Date, format: "1908-09-17" "1912-07-12" ...
##  $ Time        : chr  "17:18" "06:30" "" "18:30" ...
##  $ City        : chr  "Fort Myer" "AtlantiCity" "Victoria" NA ...
##  $ State       : chr  " Virginia" " New Jersey" " British Columbia" "Over the North Sea" ...
##  $ Operator    : chr  "Military - U.S. Army" "Military - U.S. Navy" "Private" "Military - German Navy" ...
##  $ Flight..    : chr  "" "" "-" "" ...
##  $ Route       : chr  "Demonstration" "Test flight" "" "" ...
##  $ Type        : chr  "Wright Flyer III" "Dirigible" "Curtiss seaplane" "Zeppelin L-1 (airship)" ...
##  $ Registration: chr  "" "" "" "" ...
##  $ cn.In       : chr  "1" "" "" "" ...
##  $ Aboard      : num  2 5 1 20 30 41 19 20 22 19 ...
##  $ Fatalities  : num  1 5 1 14 30 21 19 20 22 19 ...
##  $ Ground      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Summary     : chr  "During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a heig"| __truncated__ "First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight." "The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed." "The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Is"| __truncated__ ...

Analysis

Yearly crashes

yearly_crashes <- data.frame(table(as.numeric(format(dataset_2$Date_v1,'%Y'))))
yearly_crashes <- yearly_crashes %>%
  rename(year = Var1, num_of_crashes = Freq) %>%
  arrange(-num_of_crashes)
knitr::kable(yearly_crashes,"pipe")
year num_of_crashes
1972 104
1968 96
1989 95
1967 91
1973 89
1979 89
1991 88
1970 87
1994 87
1976 86
1992 86
1988 83
1969 82
1974 82
1977 81
1996 81
1946 80
1995 79
1999 78
1978 77
1947 76
2000 76
1975 75
2002 75
1985 74
1987 74
1962 73
1990 72
1982 70
2001 70
1945 69
1951 69
1964 69
1965 69
1966 69
1998 69
1948 68
1997 68
1971 67
1993 67
1981 66
1980 65
1984 65
1986 64
1950 63
1953 62
1958 62
1960 62
2008 62
1952 61
1983 61
2003 61
2004 61
1949 60
1959 60
1954 59
1963 58
1957 57
2007 54
1955 52
1961 52
1938 51
1956 51
2005 51
2006 49
1944 47
1936 43
1943 39
1928 37
1929 37
1935 36
1931 32
1942 32
1934 30
1932 28
1937 27
1933 26
1939 26
1930 24
2009 24
1927 21
1941 21
1940 18
1920 17
1921 13
1923 12
1926 12
1922 11
1925 11
1924 7
1917 6
1919 6
1916 5
1918 4
1913 3
1915 2
1908 1
1912 1
ggplot(yearly_crashes, aes(x = year, y = num_of_crashes, group = 1)) + 
  geom_line(color = "black") +
  geom_point(color = "black") +
  labs(title = "Crashes Over The Years"
       , y = "Number of Crashes"
       , x = "Years") +
  annotate("rect", xmin=28, xmax= 34, ymin=0, ymax=Inf, alpha=0.2, fill="red")  + 
  scale_x_discrete(breaks = seq(1904, 2014, by = 2) ) +
  theme(text=element_text(size=16, 
                          family="Microsoft New Tai Lue Bold"),
        axis.text.x = element_text(angle = 90, size = 5))

Crashes per operator

crashes_per_operator <- data.frame(table(dataset_2$Operator))
crashes_per_operator <- crashes_per_operator %>% mutate(percentage = Freq/sum(Freq) * 100) %>% rename("num_of_crashes"=Freq, "Operator"= Var1) 
crashes_per_operator <- crashes_per_operator %>% arrange(-percentage)
knitr::kable(head(crashes_per_operator),"pipe")
Operator num_of_crashes percentage
Aeroflot 179 3.3978740
Military - U.S. Air Force 176 3.3409263
Air France 70 1.3287775
Deutsche Lufthansa 65 1.2338648
Air Taxi 44 0.8352316
China National Aviation Corporation 44 0.8352316

I’m going to subset it becouse there are too much data and it will be difficoul to plot it correctly. For this reason I’m going to subset it in a dataframe with data of operator wich have at least 25 crashes

crashes_per_operator25 <- crashes_per_operator %>% subset(num_of_crashes >24)
knitr::kable(head (crashes_per_operator25),"pipe")
Operator num_of_crashes percentage
Aeroflot 179 3.3978740
Military - U.S. Air Force 176 3.3409263
Air France 70 1.3287775
Deutsche Lufthansa 65 1.2338648
Air Taxi 44 0.8352316
China National Aviation Corporation 44 0.8352316
ggplot(crashes_per_operator25, aes (x = reorder (Operator, -num_of_crashes), y=num_of_crashes)) + 
  geom_col(aes(fill = Operator), width=0.4, position = position_dodge(width=0.5)) + 
  theme(axis.text.x= element_text(angle = 90, size = 7)) +
  labs(title = "Number of Crashes per Operator"
       , y = "Number of Crashes"
       , x = "Operator") +
  geom_text(aes(label = signif(num_of_crashes)), nudge_y = 5, size = 2) +
   theme(legend.text=element_text(size=5))

Crashes per state

Here I create a new dataframe, showing a crash per state dataframe. Again the dataframe it’s too big, so I’m going to reduce it as before.

state_crashes <- data.frame(table(dataset_2$State))
state_crashes <- state_crashes %>%
  rename(state = Var1, num_of_crashes = Freq) %>%
  arrange(-num_of_crashes)
state_crashes_70 <- state_crashes %>% subset(num_of_crashes > 70)
knitr::kable( state_crashes_70,"pipe")
state num_of_crashes
Brazil 175
Alaska 173
Russia 166
Colombia 143
California 138
France 121
India 93
England 85
Mexico 80
Indonesia 79
China 78
Germany 78
Italy 71
ggplot(state_crashes_70, aes (x = reorder (state, -num_of_crashes), y=num_of_crashes)) + 
  geom_col(aes(fill = state), width=0.4, position = position_dodge(width=0.5)) + 
  theme(axis.text.x= element_text(angle = 90)) +
  labs(title = "Number of Crashes per State"
       , y = "Number of Crashes"
       , x = "State") + 
  geom_text(aes(label = signif(num_of_crashes)), nudge_y = 5)

Crashes per Type

Now, I create a new dataframe according to the type of airplane

dataset_types <- data.frame(table(dataset_2$Type))%>%
  rename(type = Var1, num_of_crashes = Freq) %>%
  arrange (-num_of_crashes)
knitr::kable(head (dataset_types),"pipe")
type num_of_crashes
Douglas DC-3 334
de Havilland Canada DHC-6 Twin Otter 300 81
Douglas C-47A 74
Douglas C-47 62
Douglas DC-4 40
Yakovlev YAK-40 37

Again the dataframe is too big so I’m going to subset it, additionally the are a lot of blank values.I will only keep the data of aircraft types that have had at least 20 accidents.

dataset_types[ dataset_types=="" ] <- NA
dataset_types <- dataset_types %>% na.omit(dataset_types)
View (dataset_types)
dataset_type_20 <-dataset_types %>% subset (num_of_crashes>20)
knitr::kable(dataset_type_20,"pipe")
type num_of_crashes
1 Douglas DC-3 334
2 de Havilland Canada DHC-6 Twin Otter 300 81
3 Douglas C-47A 74
4 Douglas C-47 62
5 Douglas DC-4 40
6 Yakovlev YAK-40 37
7 Antonov AN-26 36
8 Junkers JU-52/3m 32
9 Douglas C-47B 29
10 De Havilland DH-4 28
12 Douglas DC-6B 27
13 Breguet 14 23
14 Curtiss C-46A 21
ggplot(dataset_type_20, aes (x = num_of_crashes, y=reorder (type, -num_of_crashes))) + 
  geom_col(aes(fill = type), width=0.4, position = position_dodge(width=0.2)) + 
  theme(axis.text.x= element_text( size = 8)) +
  theme(axis.text.y= element_text( size = 8)) +
  labs(title = "Number of Crashes per Type"
       , y = "Number of Crashes"
       , x = "Type") + 
  geom_text(aes(label = signif(num_of_crashes)), nudge_x = 8, size = 3) +
guides (theme(legend.text=element_text(size=1)))

Fatality

Let’s analyse the fatality. This time I didn’t subset the dataframe. I have just used the orginial one just using the “date” column and fatality” column

library(scales)

ggplot(data =dataset_2, aes(x = Date_v1, y = Fatalities))+ 
  geom_col(color = "#00B8E5", size = 0.5) +
  annotate(geom = "text", x = as.Date("1977-03-27"), y = 600, label = "Canary Island (583)", size = 2)+
  annotate(geom = "text", x = as.Date("1985-08-12"), y = 540, label = "Japan (520)", size = 2) + 
  annotate(geom = "text", x = as.Date("1996-11-12"), y = 360, label = "India (349)", size = 2) +
  geom_rect(aes(xmin = as.Date("1939-09-01"),xmax = as.Date("1945-09-02"),ymin = 0, ymax = Inf),
            fill="#00BDD0", 
            alpha = 0.002) + 
  scale_fill_brewer(palette = 'Dark2', name = 'Year')+
  theme_bw()+
  annotate(geom = "text", x = as.Date("1942-11-01"), y = 360, label = "WW2", size = 3)+
  labs (title = "Fatality per year", x = "Year", y= "Deaths") + 
  stat_smooth(fill="blue", colour="blue", size=1, alpha = 0.2, method = "gam")

Crashes per hours

Here I created a new dataset: crashes per hours. there were a lot of blank values and some wrong values. i corrected it and i deleted the balnk values.

dataset_hours <- data.frame(table(dataset_2$Time))%>%
  rename(Hour = Var1, num_of_crashes = Freq) %>%
  arrange (-num_of_crashes)
dataset_hours<-separate(dataset_hours, col=Hour, into=c('Hour', 'minutes'), sep=':')
dataset_hours24<-dataset_hours%>% group_by(Hour)
dataset_hours24<-dataset_hours24 %>%subset( select = c(Hour,num_of_crashes))
dataset_hours244<-dataset_hours24 %>% group_by(Hour) %>% summarise (num_of_crash = sum(num_of_crashes))
dataset_hours25<-dataset_hours244 %>% subset( Hour!="0943" & Hour!="1" & Hour!="114" & Hour!="12'20" & Hour!="18.40" & Hour!="2" & Hour!="22'08" & Hour!="c" & Hour!="c16" & Hour!="c14" & Hour!="8" & Hour!="9")
dataset_hours25[ dataset_hours25=="" ] <- NA
dataset_hours251 <- dataset_hours25 %>% na.omit(dataset_hours251)
ggplot(dataset_hours251, aes (x = Hour, y =num_of_crash))+
  geom_col (fill = "darkcyan")+
  labs(title = "Number of Crashes per Hours"
       , y = "Number of Crashes"
       , x = "Hours") 

Crashes per route

Finally, crash by route, again i subsetted it (at least 3 crashes in the same route). At the very beginning I delated the NA values.

crashes_by_route <- data.frame(table(dataset_2$Route))
crashes_by_route <- crashes_by_route %>% mutate(percentage = Freq/sum(Freq) * 100) %>% rename("num_of_crashes"=Freq, "Route"= Var1) 
crashes_by_route <- crashes_by_route %>% arrange(-percentage)
crashes_by_route[ crashes_by_route=="" ] <- NA
crashes_by_route <- crashes_by_route %>% na.omit(crashes_by_route)

crashes_by_route3 <- crashes_by_route %>% subset(num_of_crashes > 2)
knitr::kable(crashes_by_route3,"pipe")
Route num_of_crashes percentage
2 Training 81 1.5375854
3 Sightseeing 29 0.5504935
4 Test flight 17 0.3227031
5 Test 6 0.1138952
6 Sao Paulo - Rio de Janeiro 5 0.0949127
7 Bogota - Barranquilla 4 0.0759301
8 Saigon - Paris 4 0.0759301
9 Sao Paulo - Porto Alegre 4 0.0759301
10 Villavicencio - Mitu 4 0.0759301
11 Bangkok - Hong Kong 3 0.0569476
12 Barranquilla - Bogota 3 0.0569476
13 Buenos Aires - Santiago 3 0.0569476
14 Burbank - Oakland 3 0.0569476
15 Cleveland - Chicago 3 0.0569476
16 Davao - Manila 3 0.0569476
17 Demonstration 3 0.0569476
18 Guayaquil - Cuenca 3 0.0569476
19 Guayaquil - Quito 3 0.0569476
20 Huambo - Luanda 3 0.0569476
21 Hue - Da Nang 3 0.0569476
22 Istanbul - Ankara 3 0.0569476
23 Khartoum - Juba 3 0.0569476
24 Lagos - Abuja 3 0.0569476
25 London - Glasgow 3 0.0569476
26 Otis AFB 3 0.0569476
27 Paris - London 3 0.0569476
28 Positioning 3 0.0569476
29 Quito - Cuenca 3 0.0569476
30 Rio de Janeiro - Sao Paulo 3 0.0569476
31 Rome - Athens 3 0.0569476
32 Rome - Cairo 3 0.0569476
33 Santiago de Cuba - Havana 3 0.0569476
34 Villavicencio - Bogota 3 0.0569476
ggplot(crashes_by_route3, aes (y = reorder (Route, num_of_crashes), x =num_of_crashes))+
  geom_col (aes (fill = factor (num_of_crashes)))+
  theme(axis.text.x= element_text(angle = 90, size = 7))+
  labs(title = "Number of Crashes per Route"
       , y = "Route"
       , x = "Number of crashes") +
scale_fill_manual(values = c("#ED8141", "#BB9D00",
                             "#5BB300", "yellow",
                             "#00B8E5", "#CF78FF", "#FF65AE")) +
  guides(fill=guide_legend(title="Number of crashes"))

Conclusion

Some conclusions can be drawn from the analysis presented:

Since the beginning of the Second World War, aircraft accidents have increased, stabilising from 1945 to the present day.

The operators with the most accidents are Aeroflot (Russian) and U.S. Airforce (US). to be fair, looking at the data by ‘nationality’, most of the aircraft accidents come from US operators. This is also shown by the analysis of the type of aircraft that suffered the most accidents: Douglas DC-3 (334 accidents) Only at position number 6 can one find a Russian aircraft (Yakovlev YAK-40)

It is surprising how many collisions take place in south american territory: Brazil, Colombia, Mexico. of course we should not be surprised by the figures for North America and Russia. The routes also show an interesting figure: without considering the first three routes that are Training, Sightseeing, and Test flights that are ‘neutral’ routes, the sixth and seventh are South American routes. Sao Paulo - Rio and Bogota - Barranquilla

As far as fatality is concerned, the graph shows that it has increased over time. One can see the three peaks: the disaster in the Canary Islands, Japan and India. I have decided to highlight the period relating to the Second World War here as well.

Suggestions for future analysis

I believe that improvements can be made. In particular: - Re-do the calculations without calculating incidents in wars. - Re-aggregate the data on types: combine aircraft types that belong to the same category. - It would also be useful to perform an analysis without taking into account accidents occurring in Training processes.