library(readr)
library(dplyr)
library(knitr)
library(tidyr)
library(ggplot2)
library(DATA606)
library(psych)
library(Hmisc)
library(corrplot)
library(PerformanceAnalytics)

1. Introduction

Flying is considering one of the safest modes of transportation. There are more car accidents than airplane accidents, despite the facts, there are people who are afraid of flying. It could be the spectacular nature of airplane accidents or the very nature of flying that creates irrational fear in some folks. But should potential travellers avoid airlines with less than stellar records? Are airlines with a previous history of incidents indicative of future incidents?

2. Data

The data was collected from five thirty eight, which itself was an aggregate of data from the Aviation Safety Network. These were observational studies that were documented and had their data stored in AVN database. The dataset being used has 56 airlines with 8 variables.

The dataset can be found online here: https://github.com/fivethirtyeight/data/blob/master/airline-safety/airline-safety.csv

The Aviation Safety Network database can be found here: https://aviation-safety.net/database/

Dependent Variable

In this particular dataset, and based on our question, the dependent variable is the number of recent incidents (from 2000 to 2014). The variable is numerical.

Independent Variable

The independent variable is the past (historical) number of incidents (quantitative) and a qualitative variable that can be taken into consideration is whether the airline is from a “first-world country” vs. a “non-first world country”.

3. Exploratory Data

Below is the raw data downloaded from the github

airline_df <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv")

#Let's look at the columns and the different values, making sure there are not N/A values
glimpse(airline_df)
## Observations: 56
## Variables: 8
## $ airline                <fct> Aer Lingus, Aeroflot*, Aerolineas Argentinas...
## $ avail_seat_km_per_week <dbl> 320906734, 1197672318, 385803648, 596871813,...
## $ incidents_85_99        <int> 2, 76, 6, 3, 2, 14, 2, 3, 5, 7, 3, 21, 1, 5,...
## $ fatal_accidents_85_99  <int> 0, 14, 0, 1, 0, 4, 1, 0, 0, 2, 1, 5, 0, 3, 0...
## $ fatalities_85_99       <int> 0, 128, 0, 64, 0, 79, 329, 0, 0, 50, 1, 101,...
## $ incidents_00_14        <int> 0, 6, 1, 5, 2, 6, 4, 5, 5, 4, 7, 17, 1, 0, 6...
## $ fatal_accidents_00_14  <int> 0, 1, 0, 0, 0, 2, 1, 1, 1, 0, 0, 3, 0, 0, 0,...
## $ fatalities_00_14       <int> 0, 88, 0, 0, 0, 337, 158, 7, 88, 0, 0, 416, ...
#Overall summary data of the raw data
summary(airline_df)
##                   airline   avail_seat_km_per_week incidents_85_99 
##  Aer Lingus           : 1   Min.   :2.594e+08      Min.   : 0.000  
##  Aeroflot*            : 1   1st Qu.:4.740e+08      1st Qu.: 2.000  
##  Aerolineas Argentinas: 1   Median :8.029e+08      Median : 4.000  
##  Aeromexico*          : 1   Mean   :1.385e+09      Mean   : 7.179  
##  Air Canada           : 1   3rd Qu.:1.847e+09      3rd Qu.: 8.000  
##  Air France           : 1   Max.   :7.139e+09      Max.   :76.000  
##  (Other)              :50                                          
##  fatal_accidents_85_99 fatalities_85_99 incidents_00_14  fatal_accidents_00_14
##  Min.   : 0.000        Min.   :  0.0    Min.   : 0.000   Min.   :0.0000       
##  1st Qu.: 0.000        1st Qu.:  0.0    1st Qu.: 1.000   1st Qu.:0.0000       
##  Median : 1.000        Median : 48.5    Median : 3.000   Median :0.0000       
##  Mean   : 2.179        Mean   :112.4    Mean   : 4.125   Mean   :0.6607       
##  3rd Qu.: 3.000        3rd Qu.:184.2    3rd Qu.: 5.250   3rd Qu.:1.0000       
##  Max.   :14.000        Max.   :535.0    Max.   :24.000   Max.   :3.0000       
##                                                                               
##  fatalities_00_14
##  Min.   :  0.00  
##  1st Qu.:  0.00  
##  Median :  0.00  
##  Mean   : 55.52  
##  3rd Qu.: 83.25  
##  Max.   :537.00  
## 
airline_df[-c(1)] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(airline_df, aes(x = incidents_85_99, y = incidents_00_14)) + geom_point()

ggplot(airline_df, aes(x = fatal_accidents_85_99, y = fatal_accidents_00_14)) + geom_point()

ggplot(airline_df, aes(x = fatalities_85_99, y = fatalities_00_14)) + geom_point()

first_world_rows <- c(1,5,6,8,9,10,11,12,13,15,16,18,20,24,27,28,29,31,34,38,40,42,44,46,55,53,52)

first_world_df <- airline_df[first_world_rows,]

non_first_df <- airline_df[-c(first_world_rows),]

Based on the plots, there appears seems to be some correlation between incidents in the past and future incidents. But, not much correlation between the other variables.

4. Inference/Correrlation

Is there a correlation between past incidents and future incidents?

flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
    )
}

##Above formula was taken from Reference 1

Let’s take an overall look at all the variables and see their cor and p values.

airlines <- airline_df[ ,2:length(airline_df)]

airlines.cor = cor(airlines, method = c("spearman"))

newCorr <- rcorr(as.matrix(airlines))

flattenCorrMatrix(newCorr$r, newCorr$P)

We can investigate and silo the incidents variables to investigate their relationship closer.

incidents <- lm(incidents_85_99 ~ incidents_00_14, data = airline_df)
summary(incidents)
## 
## Call:
## lm(formula = incidents_85_99 ~ incidents_00_14, data = airline_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.970 -4.046 -2.131  1.863 66.987 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       3.1421     1.8470   1.701  0.09466 . 
## incidents_00_14   0.9785     0.3024   3.236  0.00207 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.19 on 54 degrees of freedom
## Multiple R-squared:  0.1624, Adjusted R-squared:  0.1469 
## F-statistic: 10.47 on 1 and 54 DF,  p-value: 0.002073
cor(airline_df$incidents_85_99, airline_df$incidents_00_14)
## [1] 0.4030088
#Graph the plot showing the linear regession for incidents
plot(airline_df$incidents_85_99 ~ airline_df$incidents_00_14)
abline(incidents)

There is little correlation between past incidents and future incidents. The high correlation is between avail_seat_per_week and incidents_00_14. This makes sense as those that fly the most on average, will most likely have the more incidents. There is a .57 correlation between past fatal_accidents_85_99 and incidents_00_14. However, the .403 correlation between past incidents is signifcant as the p-value is 0.002 which is less than .005.

corrplot(newCorr$r, type="upper", order="hclust", 
         p.mat = newCorr$P, sig.level = 0.05, insig = "blank")

chart.Correlation(airlines, histograme=TRUE, pch=19)

palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = airlines.cor, col = palette, symm = TRUE)

Following the methodology on Five Thirty Eight, we can create safety scores using this procedure:

  1. In each category, subtract the airline’s crash rate from the total average for all airlines since 1985. (safer airlines have positive scores and less safe airlines have negative scores)

  2. Multiple the results by the square root of the total seat kilometers flown. (Benefiting those airlines that fly a lot but maintain steady safety records)

  3. Standardize the scores in each category, calculating how many SD’s above or below the mean an airline’s record lies. Then average the scores from all categoies.

The scores show, that there are “safer” airlines from higher gpd nations versus those that lower income. However, we have to keep in mind that some airlines were subjected to terrorist attacks, highjackings, etc. This blemishes records. As a whole, flying is a safe way to travel.

library(robustHD)
## Warning: package 'robustHD' was built under R version 3.6.3
## Loading required package: perry
## Warning: package 'perry' was built under R version 3.6.3
## Loading required package: parallel
## Loading required package: robustbase
## Warning: package 'robustbase' was built under R version 3.6.3
## 
## Attaching package: 'robustbase'
## The following object is masked from 'package:survival':
## 
##     heart
airline_df_no_km <- airline_df[-c(1:2)]


avgCol <- as.data.frame(colMeans(airline_df_no_km[sapply(airline_df_no_km, is.numeric)]))


colnames(avgCol) <-"Averages"

new_airline_df <- airline_df

new_airline_df$incidents_85_99 <- avgCol$Averages[1] - new_airline_df$incidents_85_99
new_airline_df$fatal_accidents_85_99 <- avgCol$Averages[2] - new_airline_df$fatal_accidents_85_99
new_airline_df$fatalities_85_99 <- avgCol$Averages[3] - new_airline_df$fatalities_85_99

new_airline_df$incidents_85_99 <- new_airline_df$incidents_85_99 * sqrt(new_airline_df$avail_seat_km_per_week)
new_airline_df$fatal_accidents_85_99 <- new_airline_df$fatal_accidents_85_99 * sqrt(new_airline_df$avail_seat_km_per_week)
new_airline_df$fatalities_85_99 <- new_airline_df$fatalities_85_99 * sqrt(new_airline_df$avail_seat_km_per_week)

new_airline_df$incidents_00_14 <- avgCol$Averages[4] - new_airline_df$incidents_00_14
new_airline_df$fatal_accidents_00_14 <- avgCol$Averages[5] - new_airline_df$fatal_accidents_00_14
new_airline_df$fatalities_00_14 <- avgCol$Averages[6] - new_airline_df$fatalities_00_14

new_airline_df$incidents_00_14 <- new_airline_df$incidents_00_14 * sqrt(new_airline_df$avail_seat_km_per_week)
new_airline_df$fatal_accidents_00_14 <- new_airline_df$fatal_accidents_00_14 * sqrt(new_airline_df$avail_seat_km_per_week)
new_airline_df$fatalities_00_14 <- new_airline_df$fatalities_00_14 * sqrt(new_airline_df$avail_seat_km_per_week)

scaled_data <- new_airline_df

scaled_data$incidents_85_99 <- standardize(scaled_data$incidents_85_99)
scaled_data$fatal_accidents_85_99 <- standardize(scaled_data$fatal_accidents_85_99)
scaled_data$fatalities_85_99 <- standardize(scaled_data$fatalities_85_99)

scaled_data$incidents_00_14 <- standardize(scaled_data$incidents_00_14)
scaled_data$fatal_accidents_00_14 <- standardize(scaled_data$fatal_accidents_00_14)
scaled_data$fatalities_00_14 <- standardize(scaled_data$fatalities_00_14)


scaled_data$score_85_99 <- (scaled_data$incidents_85_99 + scaled_data$fatal_accidents_85_99 + scaled_data$fatalities_85_99)/3
scaled_data$score_00_14 <- (scaled_data$incidents_00_14 + scaled_data$fatal_accidents_00_14 + scaled_data$fatalities_00_14)/3

scaled_data$final_score <- (scaled_data$score_85_99 + scaled_data$score_00_14) / 2


scores_df <- scaled_data[c(1,9:11)]


first_scores_df <- scaled_data[c(1,5,6,8,9,10,11,12,13,15,16,18,20,24,27,28,29,31,34,38,40,42,44,46,55,53,52),]
first_scores_df <- first_scores_df[c(1,9:11)]

other_scores_df <- scaled_data[-c(1,5,6,8,9,10,11,12,13,15,16,18,20,24,27,28,29,31,34,38,40,42,44,46,55,53,52),]
other_scores_df <- other_scores_df[c(1,9:11)]
library(reactable)
## Warning: package 'reactable' was built under R version 3.6.3
reactable(scores_df, columns = list(
                                score_85_99 = colDef(name = "Safety Score 85-99"),
                                score_00_14 = colDef(name = "Saftey Score 00-14"),
                                final_score = colDef(name = "Combined Score"),
                                final_score = colDef(style = function(value) {
                                                                        if (value > 0) {
                                                                          color <- "green"
                                                                        } else if (value < 0) {
                                                                          color <- "red"
                                                                        } else {
                                                                          color <- "#777"
                                                                        }
                                                                        list(color = color, fontWeight = "bold")
                                                                  }) ),
                                 highlight = TRUE,
                                 striped = TRUE,
                                 defaultColDef = colDef(
                                                 footer = function(values, name) htmltools::div(name, style = list(fontWeight = 600))
                                                    ),
                    
            )
reactable(first_scores_df, columns = list(
                                score_85_99 = colDef(name = "Safety Score 85-99"),
                                score_00_14 = colDef(name = "Saftey Score 00-14"),
                                final_score = colDef(name = "Combined Score"),
                                final_score = colDef(style = function(value) {
                                                                        if (value > 0) {
                                                                          color <- "green"
                                                                        } else if (value < 0) {
                                                                          color <- "red"
                                                                        } else {
                                                                          color <- "#777"
                                                                        }
                                                                        list(color = color, fontWeight = "bold")
                                                                  }) ),
                                 highlight = TRUE,
                                 striped = TRUE,
                                 defaultColDef = colDef(
                                                 footer = function(values, name) htmltools::div(name, style = list(fontWeight = 600))
                                                    ),
                    
            )
reactable(other_scores_df, columns = list(
                                score_85_99 = colDef(name = "Safety Score 85-99"),
                                score_00_14 = colDef(name = "Saftey Score 00-14"),
                                final_score = colDef(name = "Combined Score"),
                                final_score = colDef(style = function(value) {
                                                                        if (value > 0) {
                                                                          color <- "green"
                                                                        } else if (value < 0) {
                                                                          color <- "red"
                                                                        } else {
                                                                          color <- "#777"
                                                                        }
                                                                        list(color = color, fontWeight = "bold")
                                                                  }) ),
                                 highlight = TRUE,
                                 striped = TRUE,
                                 defaultColDef = colDef(
                                                 footer = function(values, name) htmltools::div(name, style = list(fontWeight = 600))
                                                    ),
                    
            )

## 5. Conclusion

Based on the data, an airline’s previous history is not a strong indicator of its future safety record. However, you’re probably safer on an airline from a higher income country. But it cannot be stressed enough that flying is still a safe method of transportation.

References

  1. http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software
  2. https://fivethirtyeight.com/features/should-travelers-avoid-flying-airlines-that-have-had-crashes-in-the-past/