Introduction

While flying is the safest way to travel, aviation accidents can still occur, whether it’s due to human error, mechanical issues, or environmental factors. When people hear about a plane accident, they may associate it with a deadly crash, but it’s often non-fatal, as seen in the findings of this report. This report shows an analysis of aviation accidents, which focuses on trends by month, year, location, aircraft category, and injury severity. Different visualizations, such as bar charts, pie graphs, heat maps, line plots, and an interactive map, are used to identify trends in aviation safety standards and technological advancements.

Dataset

The dataset used provides insight on aviation accidents worldwide from as early as 1948 to 2022, which includes 88,889 records and 31 variables, of which I’ve only used 7/31 variables. Those include ‘Event.Date’, ‘Country’, ‘Latitude’, ‘Longitude’, ‘Injury.Severity’, ‘Aircraft.Category’, and ‘Broad.Phase.of.Flight’. Out of the many variables in this dataset, many were missing records associated with them. The last recorded date in this dataset was 12/29/2022.

Findings

Which Phase of a Flight Do Accidents Mainly Occur In?

This bar chart shows the number of accidents by phase of flight. Although there are multiple different phases of a flight, most accidents occurred during landing, takeoff, and cruising, which accounted for more than 10,000 accidents each. Of the data in this dataset, these three phases represent 61.8% of the overall recorded accidents. Maneuvering and approach also contributed significantly, before the sudden drop during the climb, taxi, and go-around. Looking into the shortest phases of a flight, landing and takeoff, they disproportionately make up 45.2% of the total accidents, showing that these phases are critical.

remove_blank <- df %>%
  filter(Broad.phase.of.flight!="") %>%
  count(Broad.phase.of.flight) %>%
  data.frame()

phasecount <- remove_blank
phasecount <- phasecount[order(phasecount$n, decreasing = TRUE), ]

ggplot(phasecount, aes(x = reorder(Broad.phase.of.flight, -n), y = n)) +
  geom_bar(colour="black", fill="darkgreen", stat="identity") +
  geom_text(aes(label = comma(n)), vjust=-0.5, family="Times New Roman", size=6) +
  labs(title = "Number of Accidents by Phase of Flight", x = "Phase of Flight", y = "Number of Accidents") +
  scale_y_continuous(labels=comma) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face="bold", size=20), 
        text = element_text(family="Times New Roman"),
        axis.text.x = element_text(size = 14),  
        axis.text.y = element_text(size = 14),
        axis.title.x = element_text(size = 16, face="bold"), 
        axis.title.y = element_text(size = 16, face="bold"))


Fatal and Non-Fatal Accidents Worldwide in 2002

This interactive map shows the fatal and non-fatal accidents worldwide in 2002. Although this dataset contains more accidents, only the accidents with corresponding latitudes and longitudes are shown on the map, with the majority being located in the United States. It’s also noted that most of the accidents shown on this map occurred on or over land.

global_accidents <- df %>%
  select(Latitude, Longitude, Injury.Severity, Event.Date) %>%
  mutate(year = year(ymd(Event.Date)),
         severity = Injury.Severity) %>%
  filter(!is.na(Latitude), 
         !is.na(Longitude),
         Latitude != "",
         Longitude != "",
         severity != "" & 
           !is.na(severity) & 
           severity != "Unknown", 
         severity != "Unavailable",
         severity != "Incident",
         year == 2002) %>%
  data.frame()

global_accidents$Latitude <- as.numeric(global_accidents$Latitude)
global_accidents$Longitude <- as.numeric(global_accidents$Longitude)


map <- leaflet() %>%
  addTiles() %>%
  setView(lat = 37.0902, lng = -95.7129, zoom = 3.5) %>%
  addCircles(lat = subset(global_accidents, severity == 'Fatal')$Latitude,
             lng = subset(global_accidents, severity == 'Fatal')$Longitude,
             opacity = 10,
             color = "red",
             popup = paste0("Fatal")) %>%
  addCircles(lat = subset(global_accidents, severity == 'Non-Fatal')$Latitude,
             lng = subset(global_accidents, severity == 'Non-Fatal')$Longitude,
             opacity = 10,
             color = "darkgreen",
             popup = paste0("Non-Fatal")) 
map  


What Season has the Most Accidents?

This heatmap shows the distribution of accidents per month, starting from 2000. Accidents tend to be more frequent during the summer months, specifically July and August, while winter months showed significantly fewer accidents. Over the years, the number of accidents has shown a steady decline, which is consistent with the emergence of technology and improvement of aviation safety standards.

months_df <- df %>%
  select(Event.Date) %>%
  mutate(months = months(ymd(Event.Date), abbreviate = TRUE),
         year = year(ymd(Event.Date))) %>%
  filter(year >= 2000) %>%
  group_by(year, months) %>%
  summarise(n = length(Event.Date), .groups='keep') %>%
  data.frame()

mylevels <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
months_df$months <- factor(months_df$months, levels = mylevels)

breaks <- c(seq(0, max(months_df$n), by=50))

ggplot(months_df, aes(x = year, y = months, fill=n)) +
  geom_tile(color="white") + 
  geom_text(aes(label=comma(n)), family="Times New Roman") +
  coord_equal(ratio=1.5) +
  labs(title="Heatmap: Accidents by Months of the Year (2000-2022)",
       x = "Year",
       y = "Month",
       fill = "Accident Count") +
  theme_minimal() +
  theme(plot.title=element_text(hjust = 0.5, family="Times New Roman", size=20, face="bold"),
        text = element_text(family="Times New Roman"),
        axis.text.x = element_text(size = 14),  
        axis.text.y = element_text(size = 14),
        axis.title.x = element_text(size = 16, face="bold"), 
        axis.title.y = element_text(size = 16, face="bold"),
        legend.text = element_text(size=12),
        legend.title = element_text(size=14)) +
  scale_y_discrete(limits = rev(levels(months_df$months))) +
  scale_fill_continuous(low="lightyellow", high="red", breaks = breaks) +
  guides(fill = guide_legend(reverse=TRUE, override.aes=list(colour="black")))


Is the Decrease of Accidents due to Technology?

This visualization shows the number of accidents throughout the years by the severity of injuries, fatal and non-fatal. From 1980 to 1985, both types of injuries saw a spike in accidents before gradually decreasing, and dramatically decreasing in 2015. From 1985 to 2020, there was a 90% decrease in the total number of accidents, but the portion of fatal accidents remained roughly 20% for both categories.While the steady decline in accidents from 1990 to 2015 is likely due to technology and increased safety standards, the steep decline from 2015 to 2020 is likely due to lack of flights from Covid-19 restrictions, since the data represents every fifth year.

df <- df %>%
  mutate(Injury.Severity = gsub("\\s*\\([^\\)]+\\)","", as.character(df$Injury.Severity)))

days_df <- df %>%
  select(Event.Date, Injury.Severity) %>%
  mutate(year = year(ymd(Event.Date)),
         severity = Injury.Severity) %>%
  filter(year >= 1980 & year <= 2020) %>%
  filter(severity != "" & !is.na(severity) & severity != "Unknown", severity != "Unavailable", severity != "Incident") %>%
  mutate(decade = floor(year / 5) * 5) %>%
  group_by(decade, severity) %>%
  summarise(n = length(Event.Date), .groups='keep') %>%
  data.frame()

days_df$decade <- as.factor(days_df$decade)

fatal_hi_lo <- days_df %>%
  filter(severity == "Fatal") %>%
  filter(n == min(n) | n ==max(n)) %>%
  data.frame()

nonfatal_hi_lo <- days_df %>%
  filter(severity == "Non-Fatal") %>%
  filter(n == min(n) | n ==max(n)) %>%
  data.frame()

ggplot(days_df, aes(x = decade, y = n, group = severity)) +
  geom_line(aes(color=severity), size=2) +
  labs(title = "Accidents by Year and Severity",
       x = "Year",
       y = "Accident Count") +
  theme_light() +
  theme(plot.title = element_text(hjust=0.5, family="Times New Roman", size=20, face="bold"),
        text = element_text(family="Times New Roman"),
        axis.text.x = element_text(size = 14),  
        axis.text.y = element_text(size = 14),
        axis.title.x = element_text(size = 16, face="bold"), 
        axis.title.y = element_text(size = 16, face="bold"),
        legend.text = element_text(size=12),
        legend.title = element_text(size=14, face = "bold")) +
  geom_point(shape=21, size=4, color="black", fill="white") +
  scale_y_continuous(labels=comma) +
  geom_point(data = nonfatal_hi_lo, aes(x = decade, y = n), shape=21, size=4, fill="darkblue", color="white") +
  geom_point(data = fatal_hi_lo, aes(x = decade, y = n), shape=21, size=4, fill="darkblue", color="white") +
  geom_label_repel(data = nonfatal_hi_lo, aes(label = ifelse(n == max(n) | n == min(n), scales::comma(n), "")), 
                   box.padding = 1, 
                   point.padding = 1, 
                   size=4,
                   family = "Times New Roman",
                   color="black", 
                   segment.color = "black") +
  geom_label_repel(data = fatal_hi_lo, aes(label = ifelse(n == max(n) | n == min(n), scales::comma(n), "")), 
                   box.padding = 1, 
                   point.padding = 1, 
                   size=4, 
                   family = "Times New Roman",
                   color="black", 
                   segment.color = "black") +
  scale_color_brewer(palette = "Paired",
                     name = "Severity",
                     guide = guide_legend(reverse = TRUE))


Which Aircraft Category has the Most Accidents?

The multiple pie chart represents the number of accidents by year and aircraft category, with airplane and helicopter and all others grouped as “Other.” From 2014 to 2022, airplane accidents dominated the distribution heavily, compared to helicopter and other categories. While helicopter accidents remain constant, accidents in the “Other” category have been gradually increasing since 2020.

top_category <- count(df, Aircraft.Category)
top_category <- top_category[order(-top_category$n),]


category_df <- df %>%
  select(Aircraft.Category, Event.Date) %>%
  mutate(year = year(ymd(Event.Date)),
         myCategory = ifelse(Aircraft.Category=="Airplane", "Airplane", ifelse(Aircraft.Category=="Helicopter", "Helicopter", "Other"))) %>%
  group_by(year, myCategory) %>%
  summarise(n = length(Aircraft.Category), .groups='keep') %>%
  group_by(year) %>%
  mutate(percent_of_total = round(100*n/sum(n), 1)) %>%
  ungroup() %>%
  filter(year <= 2022) %>%
  filter(year %in% sort(unique(year), decreasing = TRUE)[1:9]) %>%
  data.frame()

ggplot(data = category_df, aes(x="", y = n, fill = myCategory)) +
  geom_bar(stat="identity", position="fill") +
  coord_polar(theta="y", start=0) +
  labs(fill = "Aircraft Category", x = NULL, y = NULL, 
       title = "Accidents by Year and Aircraft Category (2014-2022)",
       caption = "Slices under 5% are not labeled.") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5, family="Times New Roman", size=28, face="bold"),
        plot.caption = element_text(family = "Times New Roman", size = 20),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        strip.text = element_text(family = "Times New Roman", size = 24, face="bold"),
        legend.text = element_text(family = "Times New Roman", size = 20),   
        legend.title = element_text(family = "Times New Roman", size = 24, face="bold")) +
  facet_wrap(~year, ncol=3, nrow=3) +
  scale_fill_brewer(palette = "Greens") +
  geom_text(aes(x=1.7, label=ifelse(percent_of_total>5, paste0(percent_of_total, "%"), "")),
            size=8, family="Times New Roman",
            position=position_fill(vjust = 0.5))


Conclusion

Overall, while aviation is the safest mode of transportation, this report shows that accidents can still occur, whether it’s due to human error, mechanical failures, or environmental conditions. By looking into trends in accidents, based on phase of flight, injury severity, or month of occurrence, it was noted that certain phases, such as landing and takeoff, contributed to most of the accidents. Both the heat map and line plot show gradual decreases in accidents over the years from 1985 to 2022, which can be from the advancement of technology and the increase in aviation safety standards.