Project Sources

Project Information

In response to the Malaysia Airlines Flight 17 aviation incident, Nate Silver, the author of the original article, decided to identify if there existed any correlation between crash rates and a selection of major commercial airlines. Analysis was done by comparing crash rates for the airlines across two 15-year time periods (1985-1999 and 2000-2014) based on data from the Aviation Safety Network Database. One of the graphs in his article—Fatalities by Airline Are Highly Unpredictable—was used to highlight the correlation between the number of fatalities for the airlines from one period to the next. In this project I attempt to replicate this plot.

General Project Workflow

  1. Load packages and libraries.
  2. Load dataset.
  3. Observe dataset characteristics.
  4. Perform necessary feature engineering on the dataset.
  5. Plot graph.
  6. Add descriptive features to the graph. Adjust aesthetic elements.

Graph Recreation + Code Breakdown

Uncomment to install the needed packages.

#install.packages(c("fivethirtyeight","ggplot2","dplyr","scales"))

Important Libraries necessary for Graph Recreation are loaded. The “airline_safety” dataset from fivethirtyeight package is loaded and put into a new variable “airsafety” for easier typing.

library(fivethirtyeight)
library(ggplot2)
library(dplyr)
library(scales)

data(airline_safety)
airsafety <- airline_safety

An examination of the airsafety dataset is done to understand the characteristics of the unmodified dataset. Dataset dimensions (dim()), number of NA values (sum(is.na())), and the column names (colnames()) were observed.

paste("Size of data: ", dim(airsafety)[1], "x", dim(airsafety)[2])
## [1] "Size of data:  56 x 9"
paste("Number of null values: ", sum(is.na(airsafety)))
## [1] "Number of null values:  0"
colnames(airsafety)
## [1] "airline"                "incl_reg_subsidiaries"  "avail_seat_km_per_week"
## [4] "incidents_85_99"        "fatal_accidents_85_99"  "fatalities_85_99"      
## [7] "incidents_00_14"        "fatal_accidents_00_14"  "fatalities_00_14"

A head() of airsafety helps with further understanding the type of data contained in each column.

head(airsafety, 5)
## # A tibble: 5 × 9
##   airline        incl_…¹ avail…² incid…³ fatal…⁴ fatal…⁵ incid…⁶ fatal…⁷ fatal…⁸
##   <chr>          <lgl>     <dbl>   <int>   <int>   <int>   <int>   <int>   <int>
## 1 Aer Lingus     FALSE    3.21e8       2       0       0       0       0       0
## 2 Aeroflot       TRUE     1.20e9      76      14     128       6       1      88
## 3 Aerolineas Ar… FALSE    3.86e8       6       0       0       1       0       0
## 4 Aeromexico     TRUE     5.97e8       3       1      64       5       0       0
## 5 Air Canada     FALSE    1.87e9       2       0       0       2       0       0
## # … with abbreviated variable names ¹​incl_reg_subsidiaries,
## #   ²​avail_seat_km_per_week, ³​incidents_85_99, ⁴​fatal_accidents_85_99,
## #   ⁵​fatalities_85_99, ⁶​incidents_00_14, ⁷​fatal_accidents_00_14,
## #   ⁸​fatalities_00_14

According to the article, fatal crash rates were based on “the number of available seat kilometers (ASKs) … the number of seats multiplied by the number of kilometers the airline flies”. Our graph of interest specifically looks at Fatalities and ASKs of airlines. Therefore columns “fatalities_00_14”, “fatalities_85_99”, “airline”, and “avail_seat_km_per_week” are identified as the columns of interest. Per the subtitle in the graph, we should adjust fatalities by ASK for 1 trillion kilometers. The following equation is used: fatalities * 1e12 / (avail_seat_km_per_week * 52 * 15). The “avail_seat_km_per_week” is multiplied by 52 (weeks) and 15 (years) to change the values from ASKs weekly to the amount of ASKs over the 15 year periods. Dividing “fatalities_00_14” and “fatalities_85_99” by ASKs over 15 years and multiplying by 1 trillion gives the adjusted fatalities per trillion kilometers over the 15 year periods. The mutate() function is used to add these adjusted data columns to the airsafety dataset as new columns “adjust_85_99” and “adjust_00_14”.

trillion_seat_km <- (1*10^12) / (airsafety$avail_seat_km_per_week * 52 * 15)
airsafety <- mutate(airsafety, adjust_85_99=airsafety$fatalities_85_99 * trillion_seat_km)
airsafety <- mutate(airsafety, adjust_00_14=airsafety$fatalities_00_14 * trillion_seat_km)

To match the labeling format in the original plot, a new line (“\n”) replaces ” ” (spaces) in airline names with more than 1 word using the base R function sub(). For airline “Avianca”, the only one word airline shown in the original plot, a newline is manually added to the end using indexing. These modified airline names are added to airsafety as a new column “air_label” using mutate().

airsafety <- mutate(airsafety, air_label=sub(" ", "\n", airsafety$airline))
airsafety[airsafety$air_label=="Avianca","air_label"] <- "Avianca\n"

A base plot is made using ggplot(). The newly generated adjusted_85_99 (x-axis) and adjusted_00_14 (y-axis) columns are plotted using geom_point() and adjusted for shape, size, and opacity to match the original graph. geom_label() adds airline labels from the new “air_label” column to the right of a select few points based on their position using filter(). geom_smooth() creates a red correlation line with no confidence intervals by fitting a linear model with the general formula y~x (adjust_00_14~adjust_85_99). scale_x_continuous() and scale_y_continuous() both take the exact same arguments that set the axis limits (0 to 1500), tick mark intervals (every 250), and axis label format (use commas -> X,XXX). Finally, geom_vline() and geom_hline() plot lines at x = 0 and y = 0 to thicken the appearance of the axis lines.

bare_Plot <- ggplot(data=airsafety, mapping = aes(adjust_85_99,adjust_00_14)) + 
  geom_point(alpha=0.6, size = 3, shape = 16) + 
  geom_label(data = filter(airsafety, adjust_85_99 > 750 | adjust_00_14 > 650), 
             mapping = aes(label=air_label), hjust="left", nudge_x = 25, label.size = NA, fill=NA) +
  geom_smooth(method ="lm",formula=y~x,col="red",se = FALSE) +
  scale_x_continuous(labels=comma,breaks = seq(0,1500,250),limits = c(0,1500),minor_breaks = NULL) +
  scale_y_continuous(labels=comma,breaks = seq(0,1500,250),limits = c(0,1500),minor_breaks = NULL) +
  geom_vline(xintercept = 0, lwd = 0.35) + geom_hline(yintercept = 0, lwd = 0.35)

Aesthetics and important label features are added to the base plot above. The title, subtitle, and axis labels are added using labs(). Aesthetics including font size, font type, plot background color, grid line color, and axes tick color are adjusted with theme() to match the original plot as closely as possible. An online Hex/HTML color code site was used to generate the colors (https://htmlcolorcodes.com). Finally, the plot size and position is set using R code chunk options.

  bare_Plot + 
  labs(title = "Fatalities by Airline Are Highly Unpredictable", 
       subtitle = "Fatalities adjusted for seats available and distance traveled\n(deaths per 1 trillion seat kilometers)", 
       x = "1985-99", y = "2000-14") + 
  theme(plot.title = element_text(size=20,face="bold"), #change title font size + type(bold)
        plot.subtitle = element_text(size = 14), #change subtitle size
        plot.background = element_rect(fill="#f0f2f5"), #change plot background color
        panel.background = element_rect(fill="#f0f2f5"), #change color of the entire panel frame
        panel.grid.major = element_line(color="#DADBDC"), #change grid line color
        axis.ticks = element_line(color="#DADBDC"), #change axes tick color
        axis.text = element_text(size = 11), #change axes label size
        axis.title = element_text(size=15,face="bold")) #change axes title size

Conclusion

As seen in my replicated plot as well as the original in Nate Silver’s article, there exists no correlation in rate of fatalities from one period to the next for these airlines.