FiveThirtyEight - Data : Airline-Safety

TidyRecipe

Create an Example Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

Airline Data Selection & Import

When I was in High School, the Television show that everyone could not stop talking about was “Lost”. This famous show, created by J.J. Abrams, depicted a fictional plane crash and what happened subsequently to the passengers and crew. In real life, and it seems a lot more in recent history than ever, planes have been crashing or vanishing at sea at an alarming rate. Personally, my father travels about 25 weeks out of the year. Coming from a computer science background and not an engineering one. How can I give my father and the rest of the traveling community, some sort of “advise” for flying safe within the means of my area of expertise. Here is where this project can come into play. I found this dataset off of Kaggle’s GitHub page that was the data behind a very powerful article, Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?. For now, let’s create a “Tidy Recipe” for this dataset.

Kaggle’s GitHub Repo for Airline Saftey Data

Airline-Saftey_GitHub

Airline Saftey CSV RAW

CSV

## Parsed with column specification:
## cols(
##   airline = col_character(),
##   avail_seat_km_per_week = col_double(),
##   incidents_85_99 = col_double(),
##   fatal_accidents_85_99 = col_double(),
##   fatalities_85_99 = col_double(),
##   incidents_00_14 = col_double(),
##   fatal_accidents_00_14 = col_double(),
##   fatalities_00_14 = col_double()
## )
airline avail_seat_km_per_week incidents_85_99 fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14 fatalities_00_14
Aer Lingus 320906734 2 0 0 0 0 0
Aeroflot* 1197672318 76 14 128 6 1 88
Aerolineas Argentinas 385803648 6 0 0 1 0 0
Aeromexico* 596871813 3 1 64 5 0 0
Air Canada 1865253802 2 0 0 2 0 0
Air France 3004002661 14 4 79 6 2 337
Air India* 869253552 2 1 329 4 1 158
Air New Zealand* 710174817 3 0 0 5 1 7
Alaska Airlines* 965346773 5 0 0 5 1 88
Alitalia 698012498 7 2 50 4 0 0
All Nippon Airways 1841234177 3 1 1 7 0 0
American* 5228357340 21 5 101 17 3 416
Austrian Airlines 358239823 1 0 0 1 0 0
Avianca 396922563 5 3 323 0 0 0
British Airways* 3179760952 4 0 0 6 0 0
Cathay Pacific* 2582459303 0 0 0 2 0 0
China Airlines 813216487 12 6 535 2 1 225
Condor 417982610 2 1 16 0 0 0
COPA 550491507 3 1 47 0 0 0
Delta / Northwest* 6525658894 24 12 407 24 2 51
Egyptair 557699891 8 3 282 4 1 14
El Al 335448023 1 1 4 1 0 0
Ethiopian Airlines 488560643 25 5 167 5 2 92
Finnair 506464950 1 0 0 0 0 0
Garuda Indonesia 613356665 10 3 260 4 2 22

First Look

We can obserse above, that this dataset provides both the date-ranges : 85-99 and 00-14 in a row.

First we will want to convert this dataset into a format in which, for each airline there will be 2 rows: one for 85-99 and another row for 00-14.

From here, this will allow for each date range within an airline, where 1 row store the “count” which are accrued from - “incidents”, “fatal_incidents” and “fatalities”.

By doing so, this will create a new working data-frame that will be clean and easier to perform analysis on.

airline avail_seat_km_per_week incidents_85_99 incidents_00_14 fatal_accidents_85_99 fatal_accidents_00_14 fatalities_85_99 fatalities_00_14
Aer Lingus 320906734 2 0 0 0 0 0
Aeroflot* 1197672318 76 6 14 1 128 88
Aerolineas Argentinas 385803648 6 1 0 0 0 0
Aeromexico* 596871813 3 5 1 0 64 0
Air Canada 1865253802 2 2 0 0 0 0
Air France 3004002661 14 6 4 2 79 337
Air India* 869253552 2 4 1 1 329 158
Air New Zealand* 710174817 3 5 0 1 0 7
Alaska Airlines* 965346773 5 5 0 1 0 88
Alitalia 698012498 7 4 2 0 50 0
All Nippon Airways 1841234177 3 7 1 0 1 0
American* 5228357340 21 17 5 3 101 416
Austrian Airlines 358239823 1 1 0 0 0 0
Avianca 396922563 5 0 3 0 323 0
British Airways* 3179760952 4 6 0 0 0 0
Cathay Pacific* 2582459303 0 2 0 0 0 0
China Airlines 813216487 12 2 6 1 535 225
Condor 417982610 2 0 1 0 16 0
COPA 550491507 3 0 1 0 47 0
Delta / Northwest* 6525658894 24 24 12 2 407 51
Egyptair 557699891 8 4 3 1 282 14
El Al 335448023 1 1 1 0 4 0
Ethiopian Airlines 488560643 25 5 5 2 167 92
Finnair 506464950 1 0 0 0 0 0
Garuda Indonesia 613356665 10 4 3 2 260 22

Tidying -r

Using the \(tidyr\) package, we will convert this data into a long dataset using the \(gather\) function.

This will count variables “type” and “date_range” in a column.

In addition, the value of this count will be placed into another column.

airline avail_seat_km_per_week date_range_type count
Aer Lingus 320906734 incidents_85_99 2
Aeroflot* 1197672318 incidents_85_99 76
Aerolineas Argentinas 385803648 incidents_85_99 6
Aeromexico* 596871813 incidents_85_99 3
Air Canada 1865253802 incidents_85_99 2
Air France 3004002661 incidents_85_99 14
Air India* 869253552 incidents_85_99 2
Air New Zealand* 710174817 incidents_85_99 3
Alaska Airlines* 965346773 incidents_85_99 5
Alitalia 698012498 incidents_85_99 7
All Nippon Airways 1841234177 incidents_85_99 3
American* 5228357340 incidents_85_99 21
Austrian Airlines 358239823 incidents_85_99 1
Avianca 396922563 incidents_85_99 5
British Airways* 3179760952 incidents_85_99 4
Cathay Pacific* 2582459303 incidents_85_99 0
China Airlines 813216487 incidents_85_99 12
Condor 417982610 incidents_85_99 2
COPA 550491507 incidents_85_99 3
Delta / Northwest* 6525658894 incidents_85_99 24
Egyptair 557699891 incidents_85_99 8
El Al 335448023 incidents_85_99 1
Ethiopian Airlines 488560643 incidents_85_99 25
Finnair 506464950 incidents_85_99 1
Garuda Indonesia 613356665 incidents_85_99 10

From here, we split the date range and type into two different columns

Furthermore, eliminating columns which has them both together:

Splitting the Range

airline avail_seat_km_per_week date_range type count
Aer Lingus 320906734 85_99 incidents 2
Aeroflot* 1197672318 85_99 incidents 76
Aerolineas Argentinas 385803648 85_99 incidents 6
Aeromexico* 596871813 85_99 incidents 3
Air Canada 1865253802 85_99 incidents 2
Air France 3004002661 85_99 incidents 14
Air India* 869253552 85_99 incidents 2
Air New Zealand* 710174817 85_99 incidents 3
Alaska Airlines* 965346773 85_99 incidents 5
Alitalia 698012498 85_99 incidents 7
All Nippon Airways 1841234177 85_99 incidents 3
American* 5228357340 85_99 incidents 21
Austrian Airlines 358239823 85_99 incidents 1
Avianca 396922563 85_99 incidents 5
British Airways* 3179760952 85_99 incidents 4
Cathay Pacific* 2582459303 85_99 incidents 0
China Airlines 813216487 85_99 incidents 12
Condor 417982610 85_99 incidents 2
COPA 550491507 85_99 incidents 3
Delta / Northwest* 6525658894 85_99 incidents 24
Egyptair 557699891 85_99 incidents 8
El Al 335448023 85_99 incidents 1
Ethiopian Airlines 488560643 85_99 incidents 25
Finnair 506464950 85_99 incidents 1
Garuda Indonesia 613356665 85_99 incidents 10

Using the \(spread\) function and using the variables, “incidents”, “fatal_incidents” and “fatalities”, while counting the occurences to create a value.

This is to convert back into a dataset, however, creating a date range wise split in the process.

Spread

airline avail_seat_km_per_week date_range incidents fatal_accidents fatalities
Aer Lingus 320906734 00_14 0 0 0
Aer Lingus 320906734 85_99 2 0 0
Aeroflot* 1197672318 00_14 6 1 88
Aeroflot* 1197672318 85_99 76 14 128
Aerolineas Argentinas 385803648 00_14 1 0 0
Aerolineas Argentinas 385803648 85_99 6 0 0
Aeromexico* 596871813 00_14 5 0 0
Aeromexico* 596871813 85_99 3 1 64
Air Canada 1865253802 00_14 2 0 0
Air Canada 1865253802 85_99 2 0 0
Air France 3004002661 00_14 6 2 337
Air France 3004002661 85_99 14 4 79
Air India* 869253552 00_14 4 1 158
Air India* 869253552 85_99 2 1 329
Air New Zealand* 710174817 00_14 5 1 7
Air New Zealand* 710174817 85_99 3 0 0
Alaska Airlines* 965346773 00_14 5 1 88
Alaska Airlines* 965346773 85_99 5 0 0
Alitalia 698012498 00_14 4 0 0
Alitalia 698012498 85_99 7 2 50

Now the data is fully ready with each airline and a date range having 1 row each

Sorting Quick

airline avail_seat_km_per_week date_range incidents fatal_accidents fatalities
Aeroflot* 1197672318 85_99 76 14 128
Ethiopian Airlines 488560643 85_99 25 5 167
Delta / Northwest* 6525658894 00_14 24 2 51
Delta / Northwest* 6525658894 85_99 24 12 407
American* 5228357340 85_99 21 5 101
United / Continental* 7139291291 85_99 19 8 319
American* 5228357340 00_14 17 3 416
US Airways / America West* 2455687887 85_99 16 7 224
Air France 3004002661 85_99 14 4 79
United / Continental* 7139291291 00_14 14 2 109

Expansion by M. Kollontai

In order to determine whether or not some of the airlines with poorer safety records improved over the time periods within our dataset, we can look at the differences in the values by plotting them alongside one another.

To begin, let’s create a new dataframe that lists the change in the incidents, fatal accidents and fatalities for each airline. To do so, use the summarise function after grouping by airline.

Summarise allows us to combine data based on a certain grouping (in this case by \(airline\)) and create new calculated values. In this case what we want is the difference between the values within the three columns \(incidents\), \(fatal accidents\) and \(fatalities\).

Now let’s create ne dataframes showing the most improved airline (we should note that this doesn’t take into account the starting point, so an airline that improved from 200 to 100 will appear better than one that stayed at 0). We will create 3 frames - one for each of the variables.

Finally, let’s plot and see which airlines have become much more dangerous. For this example, we will take a look at the change in fatalities numbers. We will use barplot to create two plots: one of the 10 airlines with the highest increase of fatalities and one of the 10 most improved arlines in terms of number of fatalities.