The purpose of this study is to carry out an analysis of air accidents. The dataset was taken from kaggle . A data cleaning was not necessary, the dataframe was ready to use. From time to time, when I reduced the dataframe to make it more manageable and better visualisable, I cleaned the data, mainly by removing empty spaces. I would rename the blanks ‘NA’ and delete them. I displayed the datasets with the ‘Head’ command whenever they were too long so as not to stress the computer too much unnecessarily.
library ('ggpmisc')
library("ggpubr")
library(tidyverse)
library(readr)
library(lubridate)
library(dplyr)
library(tibble)
library(scales)
library(qpcR)
library(stringr)
library(data.table)
library(ggplot2)
dataset<-read.csv("C:/Users/acer/OneDrive/Desktop/data analysis/flight_crashes/Airplane_Crashes_and_Fatalities_Since_1908.csv")
View (dataset)
knitr::kable(head(dataset[, 1:6]),"pipe")
| index | Date | Time | Location | Operator | Flight.. |
|---|---|---|---|---|---|
| 0 | 09/17/1908 | 17:18 | Fort Myer, Virginia | Military - U.S. Army | |
| 1 | 07/12/1912 | 06:30 | AtlantiCity, New Jersey | Military - U.S. Navy | |
| 2 | 08/06/1913 | Victoria, British Columbia, Canada | Private | - | |
| 3 | 09/09/1913 | 18:30 | Over the North Sea | Military - German Navy | |
| 4 | 10/17/1913 | 10:30 | Near Johannisthal, Germany | Military - German Navy | |
| 5 | 03/05/1915 | 01:00 | Tienen, Belgium | Military - German Navy |
separate(dataset, col=Location, into=c('City', 'State'), sep=',')
dataset_ <- dataset %>% separate( col=Location, into=c('City', 'State'), sep=',', fill = "left")
knitr::kable(head(dataset_[, 1:6]),"pipe")
| index | Date | Time | City | State | Operator |
|---|---|---|---|---|---|
| 0 | 09/17/1908 | 17:18 | Fort Myer | Virginia | Military - U.S. Army |
| 1 | 07/12/1912 | 06:30 | AtlantiCity | New Jersey | Military - U.S. Navy |
| 2 | 08/06/1913 | Victoria | British Columbia | Private | |
| 3 | 09/09/1913 | 18:30 | NA | Over the North Sea | Military - German Navy |
| 4 | 10/17/1913 | 10:30 | Near Johannisthal | Germany | Military - German Navy |
| 5 | 03/05/1915 | 01:00 | Tienen | Belgium | Military - German Navy |
str(dataset_)
## 'data.frame': 5268 obs. of 15 variables:
## $ index : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Date : chr "09/17/1908" "07/12/1912" "08/06/1913" "09/09/1913" ...
## $ Time : chr "17:18" "06:30" "" "18:30" ...
## $ City : chr "Fort Myer" "AtlantiCity" "Victoria" NA ...
## $ State : chr " Virginia" " New Jersey" " British Columbia" "Over the North Sea" ...
## $ Operator : chr "Military - U.S. Army" "Military - U.S. Navy" "Private" "Military - German Navy" ...
## $ Flight.. : chr "" "" "-" "" ...
## $ Route : chr "Demonstration" "Test flight" "" "" ...
## $ Type : chr "Wright Flyer III" "Dirigible" "Curtiss seaplane" "Zeppelin L-1 (airship)" ...
## $ Registration: chr "" "" "" "" ...
## $ cn.In : chr "1" "" "" "" ...
## $ Aboard : num 2 5 1 20 30 41 19 20 22 19 ...
## $ Fatalities : num 1 5 1 14 30 21 19 20 22 19 ...
## $ Ground : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Summary : chr "During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a heig"| __truncated__ "First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight." "The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed." "The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Is"| __truncated__ ...
Date_v1 <- mdy (dataset_$Date)
dataset_1 <- cbind (dataset_,Date_v1)
dataset_2 <- dataset_1 %>% dplyr::select (- Date) %>% relocate(Date_v1, .before =Time)
str(dataset_2)
## 'data.frame': 5268 obs. of 15 variables:
## $ index : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Date_v1 : Date, format: "1908-09-17" "1912-07-12" ...
## $ Time : chr "17:18" "06:30" "" "18:30" ...
## $ City : chr "Fort Myer" "AtlantiCity" "Victoria" NA ...
## $ State : chr " Virginia" " New Jersey" " British Columbia" "Over the North Sea" ...
## $ Operator : chr "Military - U.S. Army" "Military - U.S. Navy" "Private" "Military - German Navy" ...
## $ Flight.. : chr "" "" "-" "" ...
## $ Route : chr "Demonstration" "Test flight" "" "" ...
## $ Type : chr "Wright Flyer III" "Dirigible" "Curtiss seaplane" "Zeppelin L-1 (airship)" ...
## $ Registration: chr "" "" "" "" ...
## $ cn.In : chr "1" "" "" "" ...
## $ Aboard : num 2 5 1 20 30 41 19 20 22 19 ...
## $ Fatalities : num 1 5 1 14 30 21 19 20 22 19 ...
## $ Ground : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Summary : chr "During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a heig"| __truncated__ "First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight." "The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed." "The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Is"| __truncated__ ...
yearly_crashes <- data.frame(table(as.numeric(format(dataset_2$Date_v1,'%Y'))))
yearly_crashes <- yearly_crashes %>%
rename(year = Var1, num_of_crashes = Freq) %>%
arrange(-num_of_crashes)
knitr::kable(yearly_crashes,"pipe")
| year | num_of_crashes |
|---|---|
| 1972 | 104 |
| 1968 | 96 |
| 1989 | 95 |
| 1967 | 91 |
| 1973 | 89 |
| 1979 | 89 |
| 1991 | 88 |
| 1970 | 87 |
| 1994 | 87 |
| 1976 | 86 |
| 1992 | 86 |
| 1988 | 83 |
| 1969 | 82 |
| 1974 | 82 |
| 1977 | 81 |
| 1996 | 81 |
| 1946 | 80 |
| 1995 | 79 |
| 1999 | 78 |
| 1978 | 77 |
| 1947 | 76 |
| 2000 | 76 |
| 1975 | 75 |
| 2002 | 75 |
| 1985 | 74 |
| 1987 | 74 |
| 1962 | 73 |
| 1990 | 72 |
| 1982 | 70 |
| 2001 | 70 |
| 1945 | 69 |
| 1951 | 69 |
| 1964 | 69 |
| 1965 | 69 |
| 1966 | 69 |
| 1998 | 69 |
| 1948 | 68 |
| 1997 | 68 |
| 1971 | 67 |
| 1993 | 67 |
| 1981 | 66 |
| 1980 | 65 |
| 1984 | 65 |
| 1986 | 64 |
| 1950 | 63 |
| 1953 | 62 |
| 1958 | 62 |
| 1960 | 62 |
| 2008 | 62 |
| 1952 | 61 |
| 1983 | 61 |
| 2003 | 61 |
| 2004 | 61 |
| 1949 | 60 |
| 1959 | 60 |
| 1954 | 59 |
| 1963 | 58 |
| 1957 | 57 |
| 2007 | 54 |
| 1955 | 52 |
| 1961 | 52 |
| 1938 | 51 |
| 1956 | 51 |
| 2005 | 51 |
| 2006 | 49 |
| 1944 | 47 |
| 1936 | 43 |
| 1943 | 39 |
| 1928 | 37 |
| 1929 | 37 |
| 1935 | 36 |
| 1931 | 32 |
| 1942 | 32 |
| 1934 | 30 |
| 1932 | 28 |
| 1937 | 27 |
| 1933 | 26 |
| 1939 | 26 |
| 1930 | 24 |
| 2009 | 24 |
| 1927 | 21 |
| 1941 | 21 |
| 1940 | 18 |
| 1920 | 17 |
| 1921 | 13 |
| 1923 | 12 |
| 1926 | 12 |
| 1922 | 11 |
| 1925 | 11 |
| 1924 | 7 |
| 1917 | 6 |
| 1919 | 6 |
| 1916 | 5 |
| 1918 | 4 |
| 1913 | 3 |
| 1915 | 2 |
| 1908 | 1 |
| 1912 | 1 |
ggplot(yearly_crashes, aes(x = year, y = num_of_crashes, group = 1)) +
geom_line(color = "black") +
geom_point(color = "black") +
labs(title = "Crashes Over The Years"
, y = "Number of Crashes"
, x = "Years") +
annotate("rect", xmin=28, xmax= 34, ymin=0, ymax=Inf, alpha=0.2, fill="red") +
scale_x_discrete(breaks = seq(1904, 2014, by = 2) ) +
theme(text=element_text(size=16,
family="Microsoft New Tai Lue Bold"),
axis.text.x = element_text(angle = 90, size = 5))
crashes_per_operator <- data.frame(table(dataset_2$Operator))
crashes_per_operator <- crashes_per_operator %>% mutate(percentage = Freq/sum(Freq) * 100) %>% rename("num_of_crashes"=Freq, "Operator"= Var1)
crashes_per_operator <- crashes_per_operator %>% arrange(-percentage)
knitr::kable(head(crashes_per_operator),"pipe")
| Operator | num_of_crashes | percentage |
|---|---|---|
| Aeroflot | 179 | 3.3978740 |
| Military - U.S. Air Force | 176 | 3.3409263 |
| Air France | 70 | 1.3287775 |
| Deutsche Lufthansa | 65 | 1.2338648 |
| Air Taxi | 44 | 0.8352316 |
| China National Aviation Corporation | 44 | 0.8352316 |
I’m going to subset it becouse there are too much data and it will be difficoul to plot it correctly. For this reason I’m going to subset it in a dataframe with data of operator wich have at least 25 crashes
crashes_per_operator25 <- crashes_per_operator %>% subset(num_of_crashes >24)
knitr::kable(head (crashes_per_operator25),"pipe")
| Operator | num_of_crashes | percentage |
|---|---|---|
| Aeroflot | 179 | 3.3978740 |
| Military - U.S. Air Force | 176 | 3.3409263 |
| Air France | 70 | 1.3287775 |
| Deutsche Lufthansa | 65 | 1.2338648 |
| Air Taxi | 44 | 0.8352316 |
| China National Aviation Corporation | 44 | 0.8352316 |
ggplot(crashes_per_operator25, aes (x = reorder (Operator, -num_of_crashes), y=num_of_crashes)) +
geom_col(aes(fill = Operator), width=0.4, position = position_dodge(width=0.5)) +
theme(axis.text.x= element_text(angle = 90, size = 7)) +
labs(title = "Number of Crashes per Operator"
, y = "Number of Crashes"
, x = "Operator") +
geom_text(aes(label = signif(num_of_crashes)), nudge_y = 5, size = 2) +
theme(legend.text=element_text(size=5))
Here I create a new dataframe, showing a crash per state dataframe. Again the dataframe it’s too big, so I’m going to reduce it as before.
state_crashes <- data.frame(table(dataset_2$State))
state_crashes <- state_crashes %>%
rename(state = Var1, num_of_crashes = Freq) %>%
arrange(-num_of_crashes)
state_crashes_70 <- state_crashes %>% subset(num_of_crashes > 70)
knitr::kable( state_crashes_70,"pipe")
| state | num_of_crashes |
|---|---|
| Brazil | 175 |
| Alaska | 173 |
| Russia | 166 |
| Colombia | 143 |
| California | 138 |
| France | 121 |
| India | 93 |
| England | 85 |
| Mexico | 80 |
| Indonesia | 79 |
| China | 78 |
| Germany | 78 |
| Italy | 71 |
ggplot(state_crashes_70, aes (x = reorder (state, -num_of_crashes), y=num_of_crashes)) +
geom_col(aes(fill = state), width=0.4, position = position_dodge(width=0.5)) +
theme(axis.text.x= element_text(angle = 90)) +
labs(title = "Number of Crashes per State"
, y = "Number of Crashes"
, x = "State") +
geom_text(aes(label = signif(num_of_crashes)), nudge_y = 5)
Now, I create a new dataframe according to the type of airplane
dataset_types <- data.frame(table(dataset_2$Type))%>%
rename(type = Var1, num_of_crashes = Freq) %>%
arrange (-num_of_crashes)
knitr::kable(head (dataset_types),"pipe")
| type | num_of_crashes |
|---|---|
| Douglas DC-3 | 334 |
| de Havilland Canada DHC-6 Twin Otter 300 | 81 |
| Douglas C-47A | 74 |
| Douglas C-47 | 62 |
| Douglas DC-4 | 40 |
| Yakovlev YAK-40 | 37 |
Again the dataframe is too big so I’m going to subset it, additionally the are a lot of blank values.I will only keep the data of aircraft types that have had at least 20 accidents.
dataset_types[ dataset_types=="" ] <- NA
dataset_types <- dataset_types %>% na.omit(dataset_types)
View (dataset_types)
dataset_type_20 <-dataset_types %>% subset (num_of_crashes>20)
knitr::kable(dataset_type_20,"pipe")
| type | num_of_crashes | |
|---|---|---|
| 1 | Douglas DC-3 | 334 |
| 2 | de Havilland Canada DHC-6 Twin Otter 300 | 81 |
| 3 | Douglas C-47A | 74 |
| 4 | Douglas C-47 | 62 |
| 5 | Douglas DC-4 | 40 |
| 6 | Yakovlev YAK-40 | 37 |
| 7 | Antonov AN-26 | 36 |
| 8 | Junkers JU-52/3m | 32 |
| 9 | Douglas C-47B | 29 |
| 10 | De Havilland DH-4 | 28 |
| 12 | Douglas DC-6B | 27 |
| 13 | Breguet 14 | 23 |
| 14 | Curtiss C-46A | 21 |
ggplot(dataset_type_20, aes (x = num_of_crashes, y=reorder (type, -num_of_crashes))) +
geom_col(aes(fill = type), width=0.4, position = position_dodge(width=0.2)) +
theme(axis.text.x= element_text( size = 8)) +
theme(axis.text.y= element_text( size = 8)) +
labs(title = "Number of Crashes per Type"
, y = "Number of Crashes"
, x = "Type") +
geom_text(aes(label = signif(num_of_crashes)), nudge_x = 8, size = 3) +
guides (theme(legend.text=element_text(size=1)))
Let’s analyse the fatality. This time I didn’t subset the dataframe. I have just used the orginial one just using the “date” column and fatality” column
library(scales)
ggplot(data =dataset_2, aes(x = Date_v1, y = Fatalities))+
geom_col(color = "#00B8E5", size = 0.5) +
annotate(geom = "text", x = as.Date("1977-03-27"), y = 600, label = "Canary Island (583)", size = 2)+
annotate(geom = "text", x = as.Date("1985-08-12"), y = 540, label = "Japan (520)", size = 2) +
annotate(geom = "text", x = as.Date("1996-11-12"), y = 360, label = "India (349)", size = 2) +
geom_rect(aes(xmin = as.Date("1939-09-01"),xmax = as.Date("1945-09-02"),ymin = 0, ymax = Inf),
fill="#00BDD0",
alpha = 0.002) +
scale_fill_brewer(palette = 'Dark2', name = 'Year')+
theme_bw()+
annotate(geom = "text", x = as.Date("1942-11-01"), y = 360, label = "WW2", size = 3)+
labs (title = "Fatality per year", x = "Year", y= "Deaths") +
stat_smooth(fill="blue", colour="blue", size=1, alpha = 0.2, method = "gam")
Here I created a new dataset: crashes per hours. there were a lot of blank values and some wrong values. i corrected it and i deleted the balnk values.
dataset_hours <- data.frame(table(dataset_2$Time))%>%
rename(Hour = Var1, num_of_crashes = Freq) %>%
arrange (-num_of_crashes)
dataset_hours<-separate(dataset_hours, col=Hour, into=c('Hour', 'minutes'), sep=':')
dataset_hours24<-dataset_hours%>% group_by(Hour)
dataset_hours24<-dataset_hours24 %>%subset( select = c(Hour,num_of_crashes))
dataset_hours244<-dataset_hours24 %>% group_by(Hour) %>% summarise (num_of_crash = sum(num_of_crashes))
dataset_hours25<-dataset_hours244 %>% subset( Hour!="0943" & Hour!="1" & Hour!="114" & Hour!="12'20" & Hour!="18.40" & Hour!="2" & Hour!="22'08" & Hour!="c" & Hour!="c16" & Hour!="c14" & Hour!="8" & Hour!="9")
dataset_hours25[ dataset_hours25=="" ] <- NA
dataset_hours251 <- dataset_hours25 %>% na.omit(dataset_hours251)
ggplot(dataset_hours251, aes (x = Hour, y =num_of_crash))+
geom_col (fill = "darkcyan")+
labs(title = "Number of Crashes per Hours"
, y = "Number of Crashes"
, x = "Hours")
Finally, crash by route, again i subsetted it (at least 3 crashes in the same route). At the very beginning I delated the NA values.
crashes_by_route <- data.frame(table(dataset_2$Route))
crashes_by_route <- crashes_by_route %>% mutate(percentage = Freq/sum(Freq) * 100) %>% rename("num_of_crashes"=Freq, "Route"= Var1)
crashes_by_route <- crashes_by_route %>% arrange(-percentage)
crashes_by_route[ crashes_by_route=="" ] <- NA
crashes_by_route <- crashes_by_route %>% na.omit(crashes_by_route)
crashes_by_route3 <- crashes_by_route %>% subset(num_of_crashes > 2)
knitr::kable(crashes_by_route3,"pipe")
| Route | num_of_crashes | percentage | |
|---|---|---|---|
| 2 | Training | 81 | 1.5375854 |
| 3 | Sightseeing | 29 | 0.5504935 |
| 4 | Test flight | 17 | 0.3227031 |
| 5 | Test | 6 | 0.1138952 |
| 6 | Sao Paulo - Rio de Janeiro | 5 | 0.0949127 |
| 7 | Bogota - Barranquilla | 4 | 0.0759301 |
| 8 | Saigon - Paris | 4 | 0.0759301 |
| 9 | Sao Paulo - Porto Alegre | 4 | 0.0759301 |
| 10 | Villavicencio - Mitu | 4 | 0.0759301 |
| 11 | Bangkok - Hong Kong | 3 | 0.0569476 |
| 12 | Barranquilla - Bogota | 3 | 0.0569476 |
| 13 | Buenos Aires - Santiago | 3 | 0.0569476 |
| 14 | Burbank - Oakland | 3 | 0.0569476 |
| 15 | Cleveland - Chicago | 3 | 0.0569476 |
| 16 | Davao - Manila | 3 | 0.0569476 |
| 17 | Demonstration | 3 | 0.0569476 |
| 18 | Guayaquil - Cuenca | 3 | 0.0569476 |
| 19 | Guayaquil - Quito | 3 | 0.0569476 |
| 20 | Huambo - Luanda | 3 | 0.0569476 |
| 21 | Hue - Da Nang | 3 | 0.0569476 |
| 22 | Istanbul - Ankara | 3 | 0.0569476 |
| 23 | Khartoum - Juba | 3 | 0.0569476 |
| 24 | Lagos - Abuja | 3 | 0.0569476 |
| 25 | London - Glasgow | 3 | 0.0569476 |
| 26 | Otis AFB | 3 | 0.0569476 |
| 27 | Paris - London | 3 | 0.0569476 |
| 28 | Positioning | 3 | 0.0569476 |
| 29 | Quito - Cuenca | 3 | 0.0569476 |
| 30 | Rio de Janeiro - Sao Paulo | 3 | 0.0569476 |
| 31 | Rome - Athens | 3 | 0.0569476 |
| 32 | Rome - Cairo | 3 | 0.0569476 |
| 33 | Santiago de Cuba - Havana | 3 | 0.0569476 |
| 34 | Villavicencio - Bogota | 3 | 0.0569476 |
ggplot(crashes_by_route3, aes (y = reorder (Route, num_of_crashes), x =num_of_crashes))+
geom_col (aes (fill = factor (num_of_crashes)))+
theme(axis.text.x= element_text(angle = 90, size = 7))+
labs(title = "Number of Crashes per Route"
, y = "Route"
, x = "Number of crashes") +
scale_fill_manual(values = c("#ED8141", "#BB9D00",
"#5BB300", "yellow",
"#00B8E5", "#CF78FF", "#FF65AE")) +
guides(fill=guide_legend(title="Number of crashes"))
Some conclusions can be drawn from the analysis presented:
Since the beginning of the Second World War, aircraft accidents have increased, stabilising from 1945 to the present day.
The operators with the most accidents are Aeroflot (Russian) and U.S. Airforce (US). to be fair, looking at the data by ‘nationality’, most of the aircraft accidents come from US operators. This is also shown by the analysis of the type of aircraft that suffered the most accidents: Douglas DC-3 (334 accidents) Only at position number 6 can one find a Russian aircraft (Yakovlev YAK-40)
It is surprising how many collisions take place in south american territory: Brazil, Colombia, Mexico. of course we should not be surprised by the figures for North America and Russia. The routes also show an interesting figure: without considering the first three routes that are Training, Sightseeing, and Test flights that are ‘neutral’ routes, the sixth and seventh are South American routes. Sao Paulo - Rio and Bogota - Barranquilla
As far as fatality is concerned, the graph shows that it has increased over time. One can see the three peaks: the disaster in the Canary Islands, Japan and India. I have decided to highlight the period relating to the Second World War here as well.
I believe that improvements can be made. In particular: - Re-do the calculations without calculating incidents in wars. - Re-aggregate the data on types: combine aircraft types that belong to the same category. - It would also be useful to perform an analysis without taking into account accidents occurring in Training processes.